Model Card for amirhossein-yousefi/speech2text-intensity-regression-wav2vec

Summary: End-to-end speech model that jointly perform automatic speech recognition (ASR) and voice intensity regression from the same input audio.:Wav2Vec2‑CTC with a regression head.

Model Details

Model Description

  • Developed by: Amirhossein Yousefi
  • Model type: Multitask speech models (ASR + scalar intensity regression).
    • facebook/wav2vec2-base-960h (CTC) + attention‑masked mean pooling regressor
  • Language(s): English (depends on chosen dataset/splits)
  • License: MIT
  • Finetuned from: facebook/wav2vec2-base-960h

Model Sources

Uses

Direct Use

  • Transcribe English speech to text (ASR) and simultaneously estimate normalized intensity for the same audio clip.
  • Interactive inference via CLI or Gradio.

Downstream Use

  • Domain‑specific fine‑tuning for ASR while keeping the intensity head.
  • Use intensity as an auxiliary signal for VAD thresholds, diarization heuristics, or UX analytics.

Out‑of‑Scope Use

  • Safety‑critical applications without human review.
  • Treating the intensity output as perceptual loudness or emotion/affect; it is RMS dBFS‑derived and sensitive to mic gain/environment.

Bias, Risks, and Limitations

  • Dataset bias: Default training on LibriSpeech (read audiobooks) may not reflect conversational or accented speech.
  • Device & environment sensitivity: Intensity depends on microphone, distance, and preprocessing.
  • Domain shift: Degradation is expected on far‑field/noisy/multilingual inputs without adaptation.

Recommendations

  • Calibrate or post‑normalize intensity for your capture setup.
  • Report WER and regression errors by domain (mic type, SNR buckets, etc.). Keep a human in the loop for sensitive deployments.

How to Get Started with the Model

Environment

python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
pip install -r requirements.txt

Train (Whisper backbone)

python -m src.speech_mtl.training.train_whisper   --model_name openai/whisper-small   --language en   --dataset librispeech_asr   --train_split train.clean.100   --eval_split validation.clean   --text_column text   --num_train_epochs 1   --output_dir outputs/whisper_small_mtl

Train (Wav2Vec2‑CTC backbone)

python -m src.speech_mtl.training.train_wav2vec2   --model_name facebook/wav2vec2-base-960h   --dataset librispeech_asr   --train_split train.clean.100   --eval_split validation.clean   --text_column text   --max_train_samples 1000   --max_eval_samples 150   --num_train_epochs 1   --output_dir outputs/wav2vec2_base_mtl

Evaluate

python -m src.speech_mtl.eval.evaluate   --whisper_model_dir outputs/whisper_small_mtl   --wav2vec2_model_dir outputs/wav2vec2_base_mtl   --dataset librispeech_asr --split test.clean --text_column text

Inference (CLI)

python -m src.speech_mtl.inference.predict   --model whisper   --checkpoint outputs/whisper_small_mtl   --audio path/to/audio.wav

Gradio Demo

python app/gradio_app.py --model whisper --checkpoint outputs/whisper_small_mtl
# or
python app/gradio_app.py --model wav2vec2 --checkpoint outputs/wav2vec2_base_mtl

Training Details

Training Data

  • Default: librispeech_asr (train.clean.100; eval on validation.clean / test.clean).
  • Optional: mozilla-foundation/common_voice_13_0 via --dataset and --language.

Intensity targets: computed from audio RMS dBFS bounded to [-60, 0], then normalized to [0, 1]:

norm_intensity = (dbfs + 60) / 60

Training Procedure

Preprocessing

  • Load/resample to 16 kHz per backbone requirements.
  • Compute intensity labels from raw audio; LUFS (via pyloudnorm) can be used as an alternative.

Training Hyperparameters

  • Training regime: fp16 mixed precision when available; batch size and LR configured via configs/*.yaml.

Speeds, Sizes, Times

  • Example single‑epoch fine‑tuned weights are linked in the repo README (training-logs/ contains logs).

Evaluation

Testing Data, Factors & Metrics

  • Testing Data: LibriSpeech test.clean by default; optionally Common Voice.
  • Factors: noise level, microphone/domain, utterance length.
  • Metrics:
    • ASR: Word Error Rate (WER)
    • Intensity regression: MAE, MSE, and R²

Results

📊 Training Logs & Metrics

  • Total FLOPs (training): 11,971,980,681,992,470,000
  • Training runtime: 9,579.8516 seconds for 3 epoch
  • Logging: TensorBoard-compatible logs in src/checkpoint/logs

You can monitor training live with:

✅ Full Metrics

🔎 Highlights

  • Validation WER (↓): 12.897% (0.128966 as fraction)
  • Validation Loss: 21.7842
  • Fast eval throughput: 17.05 samples/s4.264 steps/s

WER from jiwer.wer (fraction in [0,1]; percent shown for readability).
This run uses a CTC objective for ASR and an auxiliary intensity head (multi‑task), but only ASR metrics were logged during evaluation.

Validation (Dev)

Metric Value
Loss 21.7842
WER (↓) 0.128966 (12.897%)
Runtime (s) 158.5324 (≈ 2m 39s)
Samples / s 17.050
Steps / s 4.264
Epoch 2.8

Training Summary

Metric Value
Train Loss 227.4951
Runtime (s) 9,579.8514 (≈ 2h 39m 40s)
Samples / s 8.937
Steps / s 0.559
Epochs 3.0

Summary

Multitask objective = ASR loss + intensity regression loss (weight controlled by --lambda_intensity).

Model Examination

Inspect encoder representations/saliency to see which frames contribute most to intensity prediction.

Environmental Impact

  • Hardware Type: Laptop GPU
  • GPU: NVIDIA GeForce RTX 3080 Ti Laptop (16 GB VRAM)

Technical Specifications

Model Architecture and Objective

  • Wav2Vec2‑CTC variant: Transformer encoder with CTC head for ASR + attention‑masked mean‑pooled regressor.

Compute Infrastructure

  • Hardware: Laptop with NVIDIA RTX 3080 Ti (16 GB).
  • Software: Python, PyTorch, Hugging Face transformers/datasets, Gradio.

Citation

If you build on this work, please cite the repository.

BibTeX:

@misc{yousefi2025speechmtl,
  title        = {Speech Multitask End-to-End (ASR + Intensity Regression)},
  author       = {Yousefi, Amirhossein},
  year         = {2025},
  howpublished = {GitHub repository},
  url          = {https://github.com/amirhossein-yousefi/speech2text-intensity-regression-wav2vec}
}

APA:
Yousefi, A. (2025). Speech Multitask End‑to‑End (ASR + Intensity Regression) [Computer software]. GitHub. https://github.com/amirhossein-yousefi/speech2text-intensity-regression-wav2vec

More Information

  • Configs: configs/wav2vec2_base.yaml
  • Deployment: Amazon SageMaker packaging/inference under sagemaker/

Model Card Contact

Please open an issue in the GitHub repository.

Downloads last month
27
Safetensors
Model size
94.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Amirhossein75/speech-intensity-wav2vec

Finetuned
(152)
this model

Datasets used to train Amirhossein75/speech-intensity-wav2vec