Model Card for `speech2text-intensity-regression`

This repository provides a multi-task Whisper-based model that performs automatic speech recognition (ASR) and voice intensity (loudness) regression in a single forward pass. A lightweight regression head is attached to Whisper’s encoder to predict loudness in dBFS (RMS) or LUFS (per ITU-R BS.1770).

Model Details

Model Description

Developed by: Amirhossein Yousefi (GitHub: @amirhossein-yousefi)
Shared by : Amirhossein Yousefi
Model type: Whisper encoder–decoder (ASR) with an additional regression head on the encoder for loudness prediction
Language(s) (NLP): English by default (LibriSpeech). Multilingual is supported if trained on Common Voice with the appropriate --language code.
License: MIT
Finetuned from model : openai/whisper-small (other Whisper sizes can be used via the --model_id argument).

What’s in the repo

End-to-end training and evaluation scripts (WER + intensity RMSE)
A simple baseline intensity regressor for comparison
A Gradio demo app for local inference
Dockerfile and Amazon SageMaker training/inference helpers

Model Sources

Repository: https://github.com/amirhossein-yousefi/speech2text-intensity-regression
Demo : Local Gradio app (app/app.py)
Sample checkpoint : See link in repository README

Uses

Direct Use

Transcribe short-form or long-form speech while simultaneously estimating voice loudness (RMS (dBFS) or LUFS) for analytics, QA, or normalization workflows.
Monitor audio level trends alongside transcript quality in call analytics, content moderation pipelines, or dataset curation.

Downstream Use

Fine-tune on domain- or language-specific data (e.g., Common Voice) to adapt both transcription and loudness estimation.
Integrate the model’s loudness head into larger prosody or audio-quality monitoring systems.

Out-of-Scope Use

Emotion/affect inference: Loudness is not a proxy for emotional intensity or arousal without appropriate labels and calibration.
Legal/compliance metering: LUFS/dBFS estimates depend on microphone gain, distance, codec, and environment; do not use as a calibrated sound level meter.
Speaker health/medical conclusions: Not designed or validated for clinical use.

Bias, Risks, and Limitations

ASR robustness can degrade for accents, noisy conditions, reverberant rooms, or domains far from training data.
Loudness predictions are input-chain dependent (mic gain, compression, codecs) and may not be comparable across devices without conditioning or calibration.
LUFS vs dBFS: LUFS better correlates with perceived loudness but depends on implementation details; dBFS (RMS) is simpler but less perceptual.

Recommendations

Calibrate and/or condition on known recording chains when comparing intensity across sessions or devices.
Prefer LUFS targets (--intensity_method lufs) for perceptual alignment; use RMS dBFS for simpler, robust estimates.
Evaluate on in-domain audio (compute WER and intensity RMSE) before deployment; consider domain adaptation via fine-tuning.

How to Get Started with the Model

Install (Python 3.10+; ensure a matching PyTorch+CUDA wheel if using GPU):

git clone https://github.com/amirhossein-yousefi/speech2text-intensity-regression
cd speech2text-intensity-regression
pip install -r requirements.txt

Train (example: LibriSpeech clean-100, Whisper-small):

python src/train_multitask_whisper.py   --model_id openai/whisper-small   --dataset librispeech --librispeech_config clean   --train_split train.100 --eval_split validation --test_split test   --language en --intensity_method rms   --epochs 3 --batch_size 8 --grad_accum 2 --lr 1e-5 --fp16   --output_dir ./checkpoints/mtl_whisper_small

Evaluate on test:

python src/evaluate.py   --ckpt ./checkpoints/mtl_whisper_small   --dataset librispeech --language en --intensity_method rms

Run the local demo app:

CHECKPOINT=./checkpoints/mtl_whisper_small python app/app.py
# Open the printed Gradio URL; upload a .wav/.flac to see transcript + intensity

CLI baseline intensity regressor:

python src/baseline/baseline_intensity_regressor.py   --dataset librispeech --language en --intensity rms

Training Details

Training Data

LibriSpeech via 🤗 Datasets: openslr/librispeech_asr (use clean config; train.100, validation, test splits). Intensity targets are computed directly from audio (RMS dBFS or LUFS).
Common Voice 11.0 via 🤗 Datasets: mozilla-foundation/common_voice_11_0 (set --language, e.g., en, hi).

Note: For human-annotated arousal/intensity, you may adapt the code to datasets like MSP-Podcast or CREMA-D (ensure licensing).

Training Procedure

Preprocessing

Audio resampled to 16 kHz as required by Whisper feature extractor.
Intensity computed per clip as RMS (dBFS) or LUFS (via pyloudnorm).

Objective

A small MLP regression head is attached to the mean-pooled encoder last hidden state. Training minimizes:

total_loss = asr_ce_loss + λ * mse(intensity)

λ is controlled by --lambda_intensity (default 1.0).

Training Hyperparameters

Example: epochs=3, batch_size=8, grad_accum=2, lr=1e-5, fp16=True (see README for more).

Speeds, Sizes, Times

Base ASR backbone (example): openai/whisper-small (~244M parameters). Training time depends on hardware and dataset size.

Evaluation

Testing Data, Factors & Metrics

Testing Data: LibriSpeech test split or your in-domain test set
Factors: Noise conditions, microphones, languages, codecs
Metrics:
- ASR: WER (via jiwer)
- Intensity: RMSE in dBFS or LUFS

📊 Results & Metrics

🔎 Highlights

Test WER (↓): 4.6976
Test Intensity RMSE (↓): 0.7334
Validation WER (↓): 4.6973 • Validation Intensity RMSE (↓): 1.4492

Lower is better (↓). WER computed with jiwer. Intensity RMSE is the regression error on the loudness target (RMS dBFS by default, or LUFS if --intensity_method lufs is used).

✅ Full Metrics

Validation (Dev)

Metric	Value
Loss	2.2288
WER (↓)	4.6973
Intensity RMSE (↓)	1.4492
Runtime (s)	1,156.757 (≈ 19m 17s)
Samples / s	2.337
Steps / s	0.292
Epoch	1

Test

Metric	Value
Loss	0.6631
WER (↓)	4.6976
Intensity RMSE (↓)	0.7334
Runtime (s)	1,129.272 (≈ 18m 49s)
Samples / s	2.320
Steps / s	0.290
Epoch	1

Training Summary

Metric	Value
Train Loss	72.5232
Runtime (s)	6,115.966 (≈ 1h 41m 56s)
Samples / s	4.666
Steps / s	0.292
Epochs	1

Raw metrics (for reproducibility)

{
  "validation": {
    "eval_loss": 2.228771209716797,
    "eval_wer": 4.69732730414323,
    "eval_intensity_rmse": 1.4492216110229492,
    "eval_runtime": 1156.7567,
    "eval_samples_per_second": 2.337,
    "eval_steps_per_second": 0.292,
    "epoch": 1.0
  },
  "training": {
    "train_loss": 72.52319664163974,
    "train_runtime": 6115.9656,
    "train_samples_per_second": 4.666,
    "train_steps_per_second": 0.292,
    "epoch": 1.0
  },
  "test": {
    "test_loss": 0.6630592346191406,
    "test_wer": 4.69758064516129,
    "test_intensity_rmse": 0.7333692312240601,
    "test_runtime": 1129.2724,
    "test_samples_per_second": 2.32,
    "test_steps_per_second": 0.29,
    "epoch": 1.0
  }
}

Results

Example logs and a sample checkpoint are referenced in the repository (training-test-logs/ and the README link). Reproduce numbers with the provided scripts for your environment.

Model Examination

Inspect encoder activations or the regression head behavior across amplitude-normalized vs. unnormalized inputs to understand sensitivity to recording chain variations.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator.

Hardware Type: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM)
Hours used: Not reported (varies by user setup and dataset size)
Cloud Provider: N/A for local training; AWS SageMaker supported for cloud

Technical Specifications

Model Architecture and Objective

Whisper encoder–decoder (transformer) for ASR with an additional regression head on top of the mean-pooled encoder representation. Objective is ASR CE loss + λ·MSE for intensity.

Compute Infrastructure

Hardware

Validated on a single laptop GPU (RTX 3080 Ti Laptop). SageMaker training scripts included for cloud training.

Software

Python, PyTorch, 🤗 Transformers/Datasets, jiwer, pyloudnorm, Gradio, (optional) Amazon SageMaker.

Citation

If you use this repository, please consider citing the underlying datasets and Whisper model.

BibTeX (Whisper):

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

BibTeX (LibriSpeech):

@inproceedings{panayotov2015librispeech,
  title={Librispeech: An {ASR} corpus based on public domain audio books},
  author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},
  booktitle={2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={5206--5210},
  year={2015},
  organization={IEEE}
}

BibTeX (Common Voice):

@inproceedings{ardila2020common,
  title={Common Voice: A Massively-Multilingual Speech Corpus},
  author={Ardila, Rosana and Branson, Megan and Davis, Kelly and Henretty, Michael and Kohler, Michael and Meyer, Josh and Morais, Reuben and Saunders, Lindsay and Tyers, Francis M. and Weber, Gregor},
  booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
  pages={4218--4222},
  year={2020}
}

Glossary

ASR: Automatic Speech Recognition
WER: Word Error Rate
dBFS: Decibels relative to full scale (digital amplitude)
LUFS: Loudness Units relative to Full Scale (per ITU-R BS.1770)
Regression head: Small MLP predicting continuous loudness target

More Information

For deployment, see sagemaker/inference/ and sagemaker/train/ for AWS SageMaker examples.
For local testing and UI, see app/app.py (Gradio).

Model Card Authors

Amirhossein Yousefi and contributors

Model Card Contact

GitHub Issues on the repository

Model Card for speech2text-intensity-regression