Model Card for speech2text-intensity-regression

This repository provides a multi-task Whisper-based model that performs automatic speech recognition (ASR) and voice intensity (loudness) regression in a single forward pass. A lightweight regression head is attached to Whisper’s encoder to predict loudness in dBFS (RMS) or LUFS (per ITU-R BS.1770).

Model Details

Model Description

  • Developed by: Amirhossein Yousefi (GitHub: @amirhossein-yousefi)
  • Shared by : Amirhossein Yousefi
  • Model type: Whisper encoder–decoder (ASR) with an additional regression head on the encoder for loudness prediction
  • Language(s) (NLP): English by default (LibriSpeech). Multilingual is supported if trained on Common Voice with the appropriate --language code.
  • License: MIT
  • Finetuned from model : openai/whisper-small (other Whisper sizes can be used via the --model_id argument).

What’s in the repo

  • End-to-end training and evaluation scripts (WER + intensity RMSE)
  • A simple baseline intensity regressor for comparison
  • A Gradio demo app for local inference
  • Dockerfile and Amazon SageMaker training/inference helpers

Model Sources

Uses

Direct Use

  • Transcribe short-form or long-form speech while simultaneously estimating voice loudness (RMS (dBFS) or LUFS) for analytics, QA, or normalization workflows.
  • Monitor audio level trends alongside transcript quality in call analytics, content moderation pipelines, or dataset curation.

Downstream Use

  • Fine-tune on domain- or language-specific data (e.g., Common Voice) to adapt both transcription and loudness estimation.
  • Integrate the model’s loudness head into larger prosody or audio-quality monitoring systems.

Out-of-Scope Use

  • Emotion/affect inference: Loudness is not a proxy for emotional intensity or arousal without appropriate labels and calibration.
  • Legal/compliance metering: LUFS/dBFS estimates depend on microphone gain, distance, codec, and environment; do not use as a calibrated sound level meter.
  • Speaker health/medical conclusions: Not designed or validated for clinical use.

Bias, Risks, and Limitations

  • ASR robustness can degrade for accents, noisy conditions, reverberant rooms, or domains far from training data.
  • Loudness predictions are input-chain dependent (mic gain, compression, codecs) and may not be comparable across devices without conditioning or calibration.
  • LUFS vs dBFS: LUFS better correlates with perceived loudness but depends on implementation details; dBFS (RMS) is simpler but less perceptual.

Recommendations

  • Calibrate and/or condition on known recording chains when comparing intensity across sessions or devices.
  • Prefer LUFS targets (--intensity_method lufs) for perceptual alignment; use RMS dBFS for simpler, robust estimates.
  • Evaluate on in-domain audio (compute WER and intensity RMSE) before deployment; consider domain adaptation via fine-tuning.

How to Get Started with the Model

Install (Python 3.10+; ensure a matching PyTorch+CUDA wheel if using GPU):

git clone https://github.com/amirhossein-yousefi/speech2text-intensity-regression
cd speech2text-intensity-regression
pip install -r requirements.txt

Train (example: LibriSpeech clean-100, Whisper-small):

python src/train_multitask_whisper.py   --model_id openai/whisper-small   --dataset librispeech --librispeech_config clean   --train_split train.100 --eval_split validation --test_split test   --language en --intensity_method rms   --epochs 3 --batch_size 8 --grad_accum 2 --lr 1e-5 --fp16   --output_dir ./checkpoints/mtl_whisper_small

Evaluate on test:

python src/evaluate.py   --ckpt ./checkpoints/mtl_whisper_small   --dataset librispeech --language en --intensity_method rms

Run the local demo app:

CHECKPOINT=./checkpoints/mtl_whisper_small python app/app.py
# Open the printed Gradio URL; upload a .wav/.flac to see transcript + intensity

CLI baseline intensity regressor:

python src/baseline/baseline_intensity_regressor.py   --dataset librispeech --language en --intensity rms

Training Details

Training Data

  • LibriSpeech via 🤗 Datasets: openslr/librispeech_asr (use clean config; train.100, validation, test splits). Intensity targets are computed directly from audio (RMS dBFS or LUFS).
  • Common Voice 11.0 via 🤗 Datasets: mozilla-foundation/common_voice_11_0 (set --language, e.g., en, hi).

Note: For human-annotated arousal/intensity, you may adapt the code to datasets like MSP-Podcast or CREMA-D (ensure licensing).

Training Procedure

Preprocessing

  • Audio resampled to 16 kHz as required by Whisper feature extractor.
  • Intensity computed per clip as RMS (dBFS) or LUFS (via pyloudnorm).

Objective

A small MLP regression head is attached to the mean-pooled encoder last hidden state. Training minimizes:

total_loss = asr_ce_loss + λ * mse(intensity)

λ is controlled by --lambda_intensity (default 1.0).

Training Hyperparameters

  • Example: epochs=3, batch_size=8, grad_accum=2, lr=1e-5, fp16=True (see README for more).

Speeds, Sizes, Times

  • Base ASR backbone (example): openai/whisper-small (~244M parameters). Training time depends on hardware and dataset size.

Evaluation

Testing Data, Factors & Metrics

  • Testing Data: LibriSpeech test split or your in-domain test set
  • Factors: Noise conditions, microphones, languages, codecs
  • Metrics:
    • ASR: WER (via jiwer)
    • Intensity: RMSE in dBFS or LUFS

📊 Results & Metrics

🔎 Highlights

  • Test WER (↓): 4.6976
  • Test Intensity RMSE (↓): 0.7334
  • Validation WER (↓): 4.6973 • Validation Intensity RMSE (↓): 1.4492

Lower is better (↓). WER computed with jiwer. Intensity RMSE is the regression error on the loudness target (RMS dBFS by default, or LUFS if --intensity_method lufs is used).


✅ Full Metrics

Validation (Dev)

Metric Value
Loss 2.2288
WER (↓) 4.6973
Intensity RMSE (↓) 1.4492
Runtime (s) 1,156.757 (≈ 19m 17s)
Samples / s 2.337
Steps / s 0.292
Epoch 1

Test

Metric Value
Loss 0.6631
WER (↓) 4.6976
Intensity RMSE (↓) 0.7334
Runtime (s) 1,129.272 (≈ 18m 49s)
Samples / s 2.320
Steps / s 0.290
Epoch 1

Training Summary

Metric Value
Train Loss 72.5232
Runtime (s) 6,115.966 (≈ 1h 41m 56s)
Samples / s 4.666
Steps / s 0.292
Epochs 1

Raw metrics (for reproducibility)
{
  "validation": {
    "eval_loss": 2.228771209716797,
    "eval_wer": 4.69732730414323,
    "eval_intensity_rmse": 1.4492216110229492,
    "eval_runtime": 1156.7567,
    "eval_samples_per_second": 2.337,
    "eval_steps_per_second": 0.292,
    "epoch": 1.0
  },
  "training": {
    "train_loss": 72.52319664163974,
    "train_runtime": 6115.9656,
    "train_samples_per_second": 4.666,
    "train_steps_per_second": 0.292,
    "epoch": 1.0
  },
  "test": {
    "test_loss": 0.6630592346191406,
    "test_wer": 4.69758064516129,
    "test_intensity_rmse": 0.7333692312240601,
    "test_runtime": 1129.2724,
    "test_samples_per_second": 2.32,
    "test_steps_per_second": 0.29,
    "epoch": 1.0
  }
}

Results

  • Example logs and a sample checkpoint are referenced in the repository (training-test-logs/ and the README link). Reproduce numbers with the provided scripts for your environment.

Model Examination

  • Inspect encoder activations or the regression head behavior across amplitude-normalized vs. unnormalized inputs to understand sensitivity to recording chain variations.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator.

  • Hardware Type: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM)
  • Hours used: Not reported (varies by user setup and dataset size)
  • Cloud Provider: N/A for local training; AWS SageMaker supported for cloud

Technical Specifications

Model Architecture and Objective

  • Whisper encoder–decoder (transformer) for ASR with an additional regression head on top of the mean-pooled encoder representation. Objective is ASR CE loss + λ·MSE for intensity.

Compute Infrastructure

Hardware

  • Validated on a single laptop GPU (RTX 3080 Ti Laptop). SageMaker training scripts included for cloud training.

Software

  • Python, PyTorch, 🤗 Transformers/Datasets, jiwer, pyloudnorm, Gradio, (optional) Amazon SageMaker.

Citation

If you use this repository, please consider citing the underlying datasets and Whisper model.

BibTeX (Whisper):

@article{radford2022whisper,
  title={Robust Speech Recognition via Large-Scale Weak Supervision},
  author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal={arXiv preprint arXiv:2212.04356},
  year={2022}
}

BibTeX (LibriSpeech):

@inproceedings{panayotov2015librispeech,
  title={Librispeech: An {ASR} corpus based on public domain audio books},
  author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},
  booktitle={2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={5206--5210},
  year={2015},
  organization={IEEE}
}

BibTeX (Common Voice):

@inproceedings{ardila2020common,
  title={Common Voice: A Massively-Multilingual Speech Corpus},
  author={Ardila, Rosana and Branson, Megan and Davis, Kelly and Henretty, Michael and Kohler, Michael and Meyer, Josh and Morais, Reuben and Saunders, Lindsay and Tyers, Francis M. and Weber, Gregor},
  booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
  pages={4218--4222},
  year={2020}
}

Glossary

  • ASR: Automatic Speech Recognition
  • WER: Word Error Rate
  • dBFS: Decibels relative to full scale (digital amplitude)
  • LUFS: Loudness Units relative to Full Scale (per ITU-R BS.1770)
  • Regression head: Small MLP predicting continuous loudness target

More Information

  • For deployment, see sagemaker/inference/ and sagemaker/train/ for AWS SageMaker examples.
  • For local testing and UI, see app/app.py (Gradio).

Model Card Authors

  • Amirhossein Yousefi and contributors

Model Card Contact

  • GitHub Issues on the repository
Downloads last month
19
Safetensors
Model size
242M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Amirhossein75/speech-intensity-whisper

Finetuned
(2865)
this model

Datasets used to train Amirhossein75/speech-intensity-whisper