Model Card for `amirhossein-yousefi/speech2text-intensity-regression-wav2vec`

Summary: End-to-end speech model that jointly perform automatic speech recognition (ASR) and voice intensity regression from the same input audio.:Wav2Vec2‑CTC with a regression head.

Model Details

Model Description

Developed by: Amirhossein Yousefi
Model type: Multitask speech models (ASR + scalar intensity regression).
- facebook/wav2vec2-base-960h (CTC) + attention‑masked mean pooling regressor
Language(s): English (depends on chosen dataset/splits)
License: MIT
Finetuned from: facebook/wav2vec2-base-960h

Model Sources

Repository: https://github.com/amirhossein-yousefi/speech2text-intensity-regression-wav2vec
Demo: Gradio script in app/gradio_app.py

Uses

Direct Use

Transcribe English speech to text (ASR) and simultaneously estimate normalized intensity for the same audio clip.
Interactive inference via CLI or Gradio.

Downstream Use

Domain‑specific fine‑tuning for ASR while keeping the intensity head.
Use intensity as an auxiliary signal for VAD thresholds, diarization heuristics, or UX analytics.

Out‑of‑Scope Use

Safety‑critical applications without human review.
Treating the intensity output as perceptual loudness or emotion/affect; it is RMS dBFS‑derived and sensitive to mic gain/environment.

Bias, Risks, and Limitations

Dataset bias: Default training on LibriSpeech (read audiobooks) may not reflect conversational or accented speech.
Device & environment sensitivity: Intensity depends on microphone, distance, and preprocessing.
Domain shift: Degradation is expected on far‑field/noisy/multilingual inputs without adaptation.

Recommendations

Calibrate or post‑normalize intensity for your capture setup.
Report WER and regression errors by domain (mic type, SNR buckets, etc.). Keep a human in the loop for sensitive deployments.

How to Get Started with the Model

Environment

python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Train (Whisper backbone)

python -m src.speech_mtl.training.train_whisper   --model_name openai/whisper-small   --language en   --dataset librispeech_asr   --train_split train.clean.100   --eval_split validation.clean   --text_column text   --num_train_epochs 1   --output_dir outputs/whisper_small_mtl

Train (Wav2Vec2‑CTC backbone)

python -m src.speech_mtl.training.train_wav2vec2   --model_name facebook/wav2vec2-base-960h   --dataset librispeech_asr   --train_split train.clean.100   --eval_split validation.clean   --text_column text   --max_train_samples 1000   --max_eval_samples 150   --num_train_epochs 1   --output_dir outputs/wav2vec2_base_mtl

Evaluate

python -m src.speech_mtl.eval.evaluate   --whisper_model_dir outputs/whisper_small_mtl   --wav2vec2_model_dir outputs/wav2vec2_base_mtl   --dataset librispeech_asr --split test.clean --text_column text

Inference (CLI)

python -m src.speech_mtl.inference.predict   --model whisper   --checkpoint outputs/whisper_small_mtl   --audio path/to/audio.wav

Gradio Demo

python app/gradio_app.py --model whisper --checkpoint outputs/whisper_small_mtl
# or
python app/gradio_app.py --model wav2vec2 --checkpoint outputs/wav2vec2_base_mtl

Training Details

Training Data

Default: librispeech_asr (train.clean.100; eval on validation.clean / test.clean).
Optional: mozilla-foundation/common_voice_13_0 via --dataset and --language.

Intensity targets: computed from audio RMS dBFS bounded to [-60, 0], then normalized to [0, 1]:

norm_intensity = (dbfs + 60) / 60

Training Procedure

Preprocessing

Load/resample to 16 kHz per backbone requirements.
Compute intensity labels from raw audio; LUFS (via pyloudnorm) can be used as an alternative.

Training Hyperparameters

Training regime: fp16 mixed precision when available; batch size and LR configured via configs/*.yaml.

Speeds, Sizes, Times

Example single‑epoch fine‑tuned weights are linked in the repo README (training-logs/ contains logs).

Evaluation

Testing Data, Factors & Metrics

Testing Data: LibriSpeech test.clean by default; optionally Common Voice.
Factors: noise level, microphone/domain, utterance length.
Metrics:
- ASR: Word Error Rate (WER)
- Intensity regression: MAE, MSE, and R²

Results

📊 Training Logs & Metrics

Total FLOPs (training): 11,971,980,681,992,470,000
Training runtime: 9,579.8516 seconds for 3 epoch
Logging: TensorBoard-compatible logs in src/checkpoint/logs

You can monitor training live with:

✅ Full Metrics

🔎 Highlights

Validation WER (↓): 12.897% (0.128966 as fraction)
Validation Loss: 21.7842
Fast eval throughput: 17.05 samples/s • 4.264 steps/s

WER from jiwer.wer (fraction in [0,1]; percent shown for readability).
This run uses a CTC objective for ASR and an auxiliary intensity head (multi‑task), but only ASR metrics were logged during evaluation.

Validation (Dev)

Metric	Value
Loss	21.7842
WER (↓)	0.128966 (12.897%)
Runtime (s)	158.5324 (≈ 2m 39s)
Samples / s	17.050
Steps / s	4.264
Epoch	2.8

Training Summary

Metric	Value
Train Loss	227.4951
Runtime (s)	9,579.8514 (≈ 2h 39m 40s)
Samples / s	8.937
Steps / s	0.559
Epochs	3.0

Summary

Multitask objective = ASR loss + intensity regression loss (weight controlled by --lambda_intensity).

Model Examination

Inspect encoder representations/saliency to see which frames contribute most to intensity prediction.

Environmental Impact

Hardware Type: Laptop GPU
GPU: NVIDIA GeForce RTX 3080 Ti Laptop (16 GB VRAM)

Technical Specifications

Model Architecture and Objective

Wav2Vec2‑CTC variant: Transformer encoder with CTC head for ASR + attention‑masked mean‑pooled regressor.

Compute Infrastructure

Hardware: Laptop with NVIDIA RTX 3080 Ti (16 GB).
Software: Python, PyTorch, Hugging Face transformers/datasets, Gradio.

Citation

If you build on this work, please cite the repository.

BibTeX:

@misc{yousefi2025speechmtl,
  title        = {Speech Multitask End-to-End (ASR + Intensity Regression)},
  author       = {Yousefi, Amirhossein},
  year         = {2025},
  howpublished = {GitHub repository},
  url          = {https://github.com/amirhossein-yousefi/speech2text-intensity-regression-wav2vec}
}

APA:
Yousefi, A. (2025). Speech Multitask End‑to‑End (ASR + Intensity Regression) [Computer software]. GitHub. https://github.com/amirhossein-yousefi/speech2text-intensity-regression-wav2vec

More Information

Configs: configs/wav2vec2_base.yaml
Deployment: Amazon SageMaker packaging/inference under sagemaker/

Model Card Contact

Please open an issue in the GitHub repository.

Model Card for amirhossein-yousefi/speech2text-intensity-regression-wav2vec