--- library_name: transformers pipeline_tag: automatic-speech-recognition tags: - speech - asr - audio-regression - multitask-learning - wav2vec2 - whisper - gradio - sagemaker datasets: - librispeech_asr - mozilla-foundation/common_voice_13_0 base_model: - facebook/wav2vec2-base-960h license: mit language: en --- # Model Card for `amirhossein-yousefi/speech2text-intensity-regression-wav2vec` **Summary:** End-to-end speech model that jointly perform **automatic speech recognition (ASR)** and **voice intensity regression** from the same input audio.:**Wav2Vec2‑CTC** with a regression head. ## Model Details ### Model Description - **Developed by:** Amirhossein Yousefi - **Model type:** Multitask speech models (ASR + scalar intensity regression). - `facebook/wav2vec2-base-960h` (CTC) + attention‑masked mean pooling regressor - **Language(s):** English (depends on chosen dataset/splits) - **License:** MIT - **Finetuned from:** `facebook/wav2vec2-base-960h` ### Model Sources - **Repository:** https://github.com/amirhossein-yousefi/speech2text-intensity-regression-wav2vec - **Demo:** Gradio script in `app/gradio_app.py` ## Uses ### Direct Use - Transcribe English speech to text (ASR) and simultaneously estimate **normalized intensity** for the same audio clip. - Interactive inference via CLI or Gradio. ### Downstream Use - Domain‑specific fine‑tuning for ASR while keeping the intensity head. - Use intensity as an auxiliary signal for VAD thresholds, diarization heuristics, or UX analytics. ### Out‑of‑Scope Use - Safety‑critical applications without human review. - Treating the intensity output as perceptual loudness or emotion/affect; it is **RMS dBFS‑derived** and sensitive to mic gain/environment. ## Bias, Risks, and Limitations - **Dataset bias:** Default training on LibriSpeech (read audiobooks) may not reflect conversational or accented speech. - **Device & environment sensitivity:** Intensity depends on microphone, distance, and preprocessing. - **Domain shift:** Degradation is expected on far‑field/noisy/multilingual inputs without adaptation. ### Recommendations - Calibrate or post‑normalize intensity for your capture setup. - Report WER and regression errors by domain (mic type, SNR buckets, etc.). Keep a human in the loop for sensitive deployments. ## How to Get Started with the Model ### Environment ```bash python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate pip install -r requirements.txt ``` ### Train (Whisper backbone) ```bash python -m src.speech_mtl.training.train_whisper --model_name openai/whisper-small --language en --dataset librispeech_asr --train_split train.clean.100 --eval_split validation.clean --text_column text --num_train_epochs 1 --output_dir outputs/whisper_small_mtl ``` ### Train (Wav2Vec2‑CTC backbone) ```bash python -m src.speech_mtl.training.train_wav2vec2 --model_name facebook/wav2vec2-base-960h --dataset librispeech_asr --train_split train.clean.100 --eval_split validation.clean --text_column text --max_train_samples 1000 --max_eval_samples 150 --num_train_epochs 1 --output_dir outputs/wav2vec2_base_mtl ``` ### Evaluate ```bash python -m src.speech_mtl.eval.evaluate --whisper_model_dir outputs/whisper_small_mtl --wav2vec2_model_dir outputs/wav2vec2_base_mtl --dataset librispeech_asr --split test.clean --text_column text ``` ### Inference (CLI) ```bash python -m src.speech_mtl.inference.predict --model whisper --checkpoint outputs/whisper_small_mtl --audio path/to/audio.wav ``` ### Gradio Demo ```bash python app/gradio_app.py --model whisper --checkpoint outputs/whisper_small_mtl # or python app/gradio_app.py --model wav2vec2 --checkpoint outputs/wav2vec2_base_mtl ``` ## Training Details ### Training Data - **Default:** `librispeech_asr` (`train.clean.100`; eval on `validation.clean` / `test.clean`). - **Optional:** `mozilla-foundation/common_voice_13_0` via `--dataset` and `--language`. **Intensity targets:** computed from audio RMS dBFS bounded to `[-60, 0]`, then normalized to `[0, 1]`: ```text norm_intensity = (dbfs + 60) / 60 ``` ### Training Procedure #### Preprocessing - Load/resample to 16 kHz per backbone requirements. - Compute intensity labels from raw audio; LUFS (via `pyloudnorm`) can be used as an alternative. #### Training Hyperparameters - **Training regime:** fp16 mixed precision when available; batch size and LR configured via `configs/*.yaml`. #### Speeds, Sizes, Times - Example single‑epoch fine‑tuned weights are linked in the repo README (`training-logs/` contains logs). ## Evaluation ### Testing Data, Factors & Metrics - **Testing Data:** LibriSpeech `test.clean` by default; optionally Common Voice. - **Factors:** noise level, microphone/domain, utterance length. - **Metrics:** - **ASR:** Word Error Rate (WER) - **Intensity regression:** MAE, MSE, and R² ### Results ## 📊 Training Logs & Metrics - **Total FLOPs (training):** `11,971,980,681,992,470,000` - **Training runtime:** `9,579.8516` seconds for 3 `epoch` - **Logging:** TensorBoard-compatible logs in `src/checkpoint/logs` You can monitor training live with: ## ✅ Full Metrics ### 🔎 Highlights - **Validation WER (↓):** **12.897%** _(0.128966 as fraction)_ - **Validation Loss:** **21.7842** - Fast eval throughput: **17.05 samples/s** • **4.264 steps/s** > **WER** from `jiwer.wer` (fraction in \[0,1\]; percent shown for readability). > This run uses a **CTC** objective for ASR and an auxiliary **intensity** head (multi‑task), but only ASR metrics were logged during evaluation. #### Validation (Dev) | Metric | Value | |---|---| | **Loss** | **21.7842** | | **WER (↓)** | **0.128966** _(12.897%)_ | | **Runtime (s)** | **158.5324** _(≈ 2m 39s)_ | | **Samples / s** | **17.050** | | **Steps / s** | **4.264** | | **Epoch** | **2.8** | #### Training Summary | Metric | Value | |---|---| | **Train Loss** | **227.4951** | | **Runtime (s)** | **9,579.8514** _(≈ 2h 39m 40s)_ | | **Samples / s** | **8.937** | | **Steps / s** | **0.559** | | **Epochs** | **3.0** | --- #### Summary Multitask objective = ASR loss + intensity regression loss (weight controlled by `--lambda_intensity`). ## Model Examination Inspect encoder representations/saliency to see which frames contribute most to intensity prediction. ## Environmental Impact - **Hardware Type:** Laptop GPU - **GPU:** NVIDIA GeForce RTX 3080 Ti Laptop (16 GB VRAM) ## Technical Specifications ### Model Architecture and Objective - **Wav2Vec2‑CTC variant:** Transformer encoder with CTC head for ASR + attention‑masked mean‑pooled regressor. ### Compute Infrastructure - **Hardware:** Laptop with NVIDIA RTX 3080 Ti (16 GB). - **Software:** Python, PyTorch, Hugging Face `transformers`/`datasets`, Gradio. ## Citation If you build on this work, please cite the repository. **BibTeX:** ```bibtex @misc{yousefi2025speechmtl, title = {Speech Multitask End-to-End (ASR + Intensity Regression)}, author = {Yousefi, Amirhossein}, year = {2025}, howpublished = {GitHub repository}, url = {https://github.com/amirhossein-yousefi/speech2text-intensity-regression-wav2vec} } ``` **APA:** Yousefi, A. (2025). *Speech Multitask End‑to‑End (ASR + Intensity Regression)* [Computer software]. GitHub. https://github.com/amirhossein-yousefi/speech2text-intensity-regression-wav2vec ## More Information - Configs: `configs/wav2vec2_base.yaml` - Deployment: Amazon SageMaker packaging/inference under `sagemaker/` ## Model Card Contact Please open an issue in the GitHub repository.