Model Card for amirhossein-yousefi/speech2text-intensity-regression-wav2vec
Summary: End-to-end speech model that jointly perform automatic speech recognition (ASR) and voice intensity regression from the same input audio.:Wav2Vec2‑CTC with a regression head.
Model Details
Model Description
- Developed by: Amirhossein Yousefi
- Model type: Multitask speech models (ASR + scalar intensity regression).
facebook/wav2vec2-base-960h
(CTC) + attention‑masked mean pooling regressor
- Language(s): English (depends on chosen dataset/splits)
- License: MIT
- Finetuned from:
facebook/wav2vec2-base-960h
Model Sources
- Repository: https://github.com/amirhossein-yousefi/speech2text-intensity-regression-wav2vec
- Demo: Gradio script in
app/gradio_app.py
Uses
Direct Use
- Transcribe English speech to text (ASR) and simultaneously estimate normalized intensity for the same audio clip.
- Interactive inference via CLI or Gradio.
Downstream Use
- Domain‑specific fine‑tuning for ASR while keeping the intensity head.
- Use intensity as an auxiliary signal for VAD thresholds, diarization heuristics, or UX analytics.
Out‑of‑Scope Use
- Safety‑critical applications without human review.
- Treating the intensity output as perceptual loudness or emotion/affect; it is RMS dBFS‑derived and sensitive to mic gain/environment.
Bias, Risks, and Limitations
- Dataset bias: Default training on LibriSpeech (read audiobooks) may not reflect conversational or accented speech.
- Device & environment sensitivity: Intensity depends on microphone, distance, and preprocessing.
- Domain shift: Degradation is expected on far‑field/noisy/multilingual inputs without adaptation.
Recommendations
- Calibrate or post‑normalize intensity for your capture setup.
- Report WER and regression errors by domain (mic type, SNR buckets, etc.). Keep a human in the loop for sensitive deployments.
How to Get Started with the Model
Environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
Train (Whisper backbone)
python -m src.speech_mtl.training.train_whisper --model_name openai/whisper-small --language en --dataset librispeech_asr --train_split train.clean.100 --eval_split validation.clean --text_column text --num_train_epochs 1 --output_dir outputs/whisper_small_mtl
Train (Wav2Vec2‑CTC backbone)
python -m src.speech_mtl.training.train_wav2vec2 --model_name facebook/wav2vec2-base-960h --dataset librispeech_asr --train_split train.clean.100 --eval_split validation.clean --text_column text --max_train_samples 1000 --max_eval_samples 150 --num_train_epochs 1 --output_dir outputs/wav2vec2_base_mtl
Evaluate
python -m src.speech_mtl.eval.evaluate --whisper_model_dir outputs/whisper_small_mtl --wav2vec2_model_dir outputs/wav2vec2_base_mtl --dataset librispeech_asr --split test.clean --text_column text
Inference (CLI)
python -m src.speech_mtl.inference.predict --model whisper --checkpoint outputs/whisper_small_mtl --audio path/to/audio.wav
Gradio Demo
python app/gradio_app.py --model whisper --checkpoint outputs/whisper_small_mtl
# or
python app/gradio_app.py --model wav2vec2 --checkpoint outputs/wav2vec2_base_mtl
Training Details
Training Data
- Default:
librispeech_asr
(train.clean.100
; eval onvalidation.clean
/test.clean
). - Optional:
mozilla-foundation/common_voice_13_0
via--dataset
and--language
.
Intensity targets: computed from audio RMS dBFS bounded to [-60, 0]
, then normalized to [0, 1]
:
norm_intensity = (dbfs + 60) / 60
Training Procedure
Preprocessing
- Load/resample to 16 kHz per backbone requirements.
- Compute intensity labels from raw audio; LUFS (via
pyloudnorm
) can be used as an alternative.
Training Hyperparameters
- Training regime: fp16 mixed precision when available; batch size and LR configured via
configs/*.yaml
.
Speeds, Sizes, Times
- Example single‑epoch fine‑tuned weights are linked in the repo README (
training-logs/
contains logs).
Evaluation
Testing Data, Factors & Metrics
- Testing Data: LibriSpeech
test.clean
by default; optionally Common Voice. - Factors: noise level, microphone/domain, utterance length.
- Metrics:
- ASR: Word Error Rate (WER)
- Intensity regression: MAE, MSE, and R²
Results
📊 Training Logs & Metrics
- Total FLOPs (training):
11,971,980,681,992,470,000
- Training runtime:
9,579.8516
seconds for 3epoch
- Logging: TensorBoard-compatible logs in
src/checkpoint/logs
You can monitor training live with:
✅ Full Metrics
🔎 Highlights
- Validation WER (↓): 12.897% (0.128966 as fraction)
- Validation Loss: 21.7842
- Fast eval throughput: 17.05 samples/s • 4.264 steps/s
WER from
jiwer.wer
(fraction in [0,1]; percent shown for readability).
This run uses a CTC objective for ASR and an auxiliary intensity head (multi‑task), but only ASR metrics were logged during evaluation.
Validation (Dev)
Metric | Value |
---|---|
Loss | 21.7842 |
WER (↓) | 0.128966 (12.897%) |
Runtime (s) | 158.5324 (≈ 2m 39s) |
Samples / s | 17.050 |
Steps / s | 4.264 |
Epoch | 2.8 |
Training Summary
Metric | Value |
---|---|
Train Loss | 227.4951 |
Runtime (s) | 9,579.8514 (≈ 2h 39m 40s) |
Samples / s | 8.937 |
Steps / s | 0.559 |
Epochs | 3.0 |
Summary
Multitask objective = ASR loss + intensity regression loss (weight controlled by --lambda_intensity
).
Model Examination
Inspect encoder representations/saliency to see which frames contribute most to intensity prediction.
Environmental Impact
- Hardware Type: Laptop GPU
- GPU: NVIDIA GeForce RTX 3080 Ti Laptop (16 GB VRAM)
Technical Specifications
Model Architecture and Objective
- Wav2Vec2‑CTC variant: Transformer encoder with CTC head for ASR + attention‑masked mean‑pooled regressor.
Compute Infrastructure
- Hardware: Laptop with NVIDIA RTX 3080 Ti (16 GB).
- Software: Python, PyTorch, Hugging Face
transformers
/datasets
, Gradio.
Citation
If you build on this work, please cite the repository.
BibTeX:
@misc{yousefi2025speechmtl,
title = {Speech Multitask End-to-End (ASR + Intensity Regression)},
author = {Yousefi, Amirhossein},
year = {2025},
howpublished = {GitHub repository},
url = {https://github.com/amirhossein-yousefi/speech2text-intensity-regression-wav2vec}
}
APA:
Yousefi, A. (2025). Speech Multitask End‑to‑End (ASR + Intensity Regression) [Computer software]. GitHub. https://github.com/amirhossein-yousefi/speech2text-intensity-regression-wav2vec
More Information
- Configs:
configs/wav2vec2_base.yaml
- Deployment: Amazon SageMaker packaging/inference under
sagemaker/
Model Card Contact
Please open an issue in the GitHub repository.
- Downloads last month
- 27
Model tree for Amirhossein75/speech-intensity-wav2vec
Base model
facebook/wav2vec2-base-960h