Model Card for speech2text-intensity-regression
This repository provides a multi-task Whisper-based model that performs automatic speech recognition (ASR) and voice intensity (loudness) regression in a single forward pass. A lightweight regression head is attached to Whisper’s encoder to predict loudness in dBFS (RMS) or LUFS (per ITU-R BS.1770).
Model Details
Model Description
- Developed by: Amirhossein Yousefi (GitHub: @amirhossein-yousefi)
- Shared by : Amirhossein Yousefi
- Model type: Whisper encoder–decoder (ASR) with an additional regression head on the encoder for loudness prediction
- Language(s) (NLP): English by default (LibriSpeech). Multilingual is supported if trained on Common Voice with the appropriate
--language
code. - License: MIT
- Finetuned from model :
openai/whisper-small
(other Whisper sizes can be used via the--model_id
argument).
What’s in the repo
- End-to-end training and evaluation scripts (WER + intensity RMSE)
- A simple baseline intensity regressor for comparison
- A Gradio demo app for local inference
- Dockerfile and Amazon SageMaker training/inference helpers
Model Sources
- Repository: https://github.com/amirhossein-yousefi/speech2text-intensity-regression
- Demo : Local Gradio app (
app/app.py
) - Sample checkpoint : See link in repository README
Uses
Direct Use
- Transcribe short-form or long-form speech while simultaneously estimating voice loudness (RMS (dBFS) or LUFS) for analytics, QA, or normalization workflows.
- Monitor audio level trends alongside transcript quality in call analytics, content moderation pipelines, or dataset curation.
Downstream Use
- Fine-tune on domain- or language-specific data (e.g., Common Voice) to adapt both transcription and loudness estimation.
- Integrate the model’s loudness head into larger prosody or audio-quality monitoring systems.
Out-of-Scope Use
- Emotion/affect inference: Loudness is not a proxy for emotional intensity or arousal without appropriate labels and calibration.
- Legal/compliance metering: LUFS/dBFS estimates depend on microphone gain, distance, codec, and environment; do not use as a calibrated sound level meter.
- Speaker health/medical conclusions: Not designed or validated for clinical use.
Bias, Risks, and Limitations
- ASR robustness can degrade for accents, noisy conditions, reverberant rooms, or domains far from training data.
- Loudness predictions are input-chain dependent (mic gain, compression, codecs) and may not be comparable across devices without conditioning or calibration.
- LUFS vs dBFS: LUFS better correlates with perceived loudness but depends on implementation details; dBFS (RMS) is simpler but less perceptual.
Recommendations
- Calibrate and/or condition on known recording chains when comparing intensity across sessions or devices.
- Prefer LUFS targets (
--intensity_method lufs
) for perceptual alignment; use RMS dBFS for simpler, robust estimates. - Evaluate on in-domain audio (compute WER and intensity RMSE) before deployment; consider domain adaptation via fine-tuning.
How to Get Started with the Model
Install (Python 3.10+; ensure a matching PyTorch+CUDA wheel if using GPU):
git clone https://github.com/amirhossein-yousefi/speech2text-intensity-regression
cd speech2text-intensity-regression
pip install -r requirements.txt
Train (example: LibriSpeech clean-100, Whisper-small):
python src/train_multitask_whisper.py --model_id openai/whisper-small --dataset librispeech --librispeech_config clean --train_split train.100 --eval_split validation --test_split test --language en --intensity_method rms --epochs 3 --batch_size 8 --grad_accum 2 --lr 1e-5 --fp16 --output_dir ./checkpoints/mtl_whisper_small
Evaluate on test:
python src/evaluate.py --ckpt ./checkpoints/mtl_whisper_small --dataset librispeech --language en --intensity_method rms
Run the local demo app:
CHECKPOINT=./checkpoints/mtl_whisper_small python app/app.py
# Open the printed Gradio URL; upload a .wav/.flac to see transcript + intensity
CLI baseline intensity regressor:
python src/baseline/baseline_intensity_regressor.py --dataset librispeech --language en --intensity rms
Training Details
Training Data
- LibriSpeech via 🤗 Datasets:
openslr/librispeech_asr
(useclean
config;train.100
,validation
,test
splits). Intensity targets are computed directly from audio (RMS dBFS or LUFS). - Common Voice 11.0 via 🤗 Datasets:
mozilla-foundation/common_voice_11_0
(set--language
, e.g.,en
,hi
).
Note: For human-annotated arousal/intensity, you may adapt the code to datasets like MSP-Podcast or CREMA-D (ensure licensing).
Training Procedure
Preprocessing
- Audio resampled to 16 kHz as required by Whisper feature extractor.
- Intensity computed per clip as RMS (dBFS) or LUFS (via
pyloudnorm
).
Objective
A small MLP regression head is attached to the mean-pooled encoder last hidden state. Training minimizes:
total_loss = asr_ce_loss + λ * mse(intensity)
λ
is controlled by --lambda_intensity
(default 1.0
).
Training Hyperparameters
- Example:
epochs=3
,batch_size=8
,grad_accum=2
,lr=1e-5
,fp16=True
(see README for more).
Speeds, Sizes, Times
- Base ASR backbone (example):
openai/whisper-small
(~244M parameters). Training time depends on hardware and dataset size.
Evaluation
Testing Data, Factors & Metrics
- Testing Data: LibriSpeech test split or your in-domain test set
- Factors: Noise conditions, microphones, languages, codecs
- Metrics:
- ASR: WER (via
jiwer
) - Intensity: RMSE in dBFS or LUFS
- ASR: WER (via
📊 Results & Metrics
🔎 Highlights
- Test WER (↓): 4.6976
- Test Intensity RMSE (↓): 0.7334
- Validation WER (↓): 4.6973 • Validation Intensity RMSE (↓): 1.4492
Lower is better (↓). WER computed with
jiwer
. Intensity RMSE is the regression error on the loudness target (RMS dBFS by default, or LUFS if--intensity_method lufs
is used).
✅ Full Metrics
Validation (Dev)
Metric | Value |
---|---|
Loss | 2.2288 |
WER (↓) | 4.6973 |
Intensity RMSE (↓) | 1.4492 |
Runtime (s) | 1,156.757 (≈ 19m 17s) |
Samples / s | 2.337 |
Steps / s | 0.292 |
Epoch | 1 |
Test
Metric | Value |
---|---|
Loss | 0.6631 |
WER (↓) | 4.6976 |
Intensity RMSE (↓) | 0.7334 |
Runtime (s) | 1,129.272 (≈ 18m 49s) |
Samples / s | 2.320 |
Steps / s | 0.290 |
Epoch | 1 |
Training Summary
Metric | Value |
---|---|
Train Loss | 72.5232 |
Runtime (s) | 6,115.966 (≈ 1h 41m 56s) |
Samples / s | 4.666 |
Steps / s | 0.292 |
Epochs | 1 |
Raw metrics (for reproducibility)
{
"validation": {
"eval_loss": 2.228771209716797,
"eval_wer": 4.69732730414323,
"eval_intensity_rmse": 1.4492216110229492,
"eval_runtime": 1156.7567,
"eval_samples_per_second": 2.337,
"eval_steps_per_second": 0.292,
"epoch": 1.0
},
"training": {
"train_loss": 72.52319664163974,
"train_runtime": 6115.9656,
"train_samples_per_second": 4.666,
"train_steps_per_second": 0.292,
"epoch": 1.0
},
"test": {
"test_loss": 0.6630592346191406,
"test_wer": 4.69758064516129,
"test_intensity_rmse": 0.7333692312240601,
"test_runtime": 1129.2724,
"test_samples_per_second": 2.32,
"test_steps_per_second": 0.29,
"epoch": 1.0
}
}
Results
- Example logs and a sample checkpoint are referenced in the repository (
training-test-logs/
and the README link). Reproduce numbers with the provided scripts for your environment.
Model Examination
- Inspect encoder activations or the regression head behavior across amplitude-normalized vs. unnormalized inputs to understand sensitivity to recording chain variations.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator.
- Hardware Type: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM)
- Hours used: Not reported (varies by user setup and dataset size)
- Cloud Provider: N/A for local training; AWS SageMaker supported for cloud
Technical Specifications
Model Architecture and Objective
- Whisper encoder–decoder (transformer) for ASR with an additional regression head on top of the mean-pooled encoder representation. Objective is ASR CE loss + λ·MSE for intensity.
Compute Infrastructure
Hardware
- Validated on a single laptop GPU (RTX 3080 Ti Laptop). SageMaker training scripts included for cloud training.
Software
- Python, PyTorch, 🤗 Transformers/Datasets,
jiwer
,pyloudnorm
, Gradio, (optional) Amazon SageMaker.
Citation
If you use this repository, please consider citing the underlying datasets and Whisper model.
BibTeX (Whisper):
@article{radford2022whisper,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal={arXiv preprint arXiv:2212.04356},
year={2022}
}
BibTeX (LibriSpeech):
@inproceedings{panayotov2015librispeech,
title={Librispeech: An {ASR} corpus based on public domain audio books},
author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},
booktitle={2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={5206--5210},
year={2015},
organization={IEEE}
}
BibTeX (Common Voice):
@inproceedings{ardila2020common,
title={Common Voice: A Massively-Multilingual Speech Corpus},
author={Ardila, Rosana and Branson, Megan and Davis, Kelly and Henretty, Michael and Kohler, Michael and Meyer, Josh and Morais, Reuben and Saunders, Lindsay and Tyers, Francis M. and Weber, Gregor},
booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
pages={4218--4222},
year={2020}
}
Glossary
- ASR: Automatic Speech Recognition
- WER: Word Error Rate
- dBFS: Decibels relative to full scale (digital amplitude)
- LUFS: Loudness Units relative to Full Scale (per ITU-R BS.1770)
- Regression head: Small MLP predicting continuous loudness target
More Information
- For deployment, see
sagemaker/inference/
andsagemaker/train/
for AWS SageMaker examples. - For local testing and UI, see
app/app.py
(Gradio).
Model Card Authors
- Amirhossein Yousefi and contributors
Model Card Contact
- GitHub Issues on the repository
- Downloads last month
- 19
Model tree for Amirhossein75/speech-intensity-whisper
Base model
openai/whisper-small