---
library_name: transformers
pipeline_tag: automatic-speech-recognition
tags:
  - speech
  - asr
  - audio-regression
  - multitask-learning
  - wav2vec2
  - whisper
  - gradio
  - sagemaker
datasets:
  - librispeech_asr
  - mozilla-foundation/common_voice_13_0
base_model:
  - facebook/wav2vec2-base-960h
license: mit
language: en
---

# Model Card for `amirhossein-yousefi/speech2text-intensity-regression-wav2vec`

**Summary:** End-to-end speech model that jointly perform **automatic speech recognition (ASR)** and **voice intensity regression** from the same input audio.:**Wav2Vec2‑CTC** with a regression head.

## Model Details

### Model Description
- **Developed by:** Amirhossein Yousefi
- **Model type:** Multitask speech models (ASR + scalar intensity regression).
  - `facebook/wav2vec2-base-960h` (CTC) + attention‑masked mean pooling regressor
- **Language(s):** English (depends on chosen dataset/splits)
- **License:** MIT
- **Finetuned from:** `facebook/wav2vec2-base-960h`

### Model Sources
- **Repository:** https://github.com/amirhossein-yousefi/speech2text-intensity-regression-wav2vec
- **Demo:** Gradio script in `app/gradio_app.py`

## Uses

### Direct Use
- Transcribe English speech to text (ASR) and simultaneously estimate **normalized intensity** for the same audio clip.
- Interactive inference via CLI or Gradio.

### Downstream Use
- Domain‑specific fine‑tuning for ASR while keeping the intensity head.
- Use intensity as an auxiliary signal for VAD thresholds, diarization heuristics, or UX analytics.

### Out‑of‑Scope Use
- Safety‑critical applications without human review.
- Treating the intensity output as perceptual loudness or emotion/affect; it is **RMS dBFS‑derived** and sensitive to mic gain/environment.

## Bias, Risks, and Limitations
- **Dataset bias:** Default training on LibriSpeech (read audiobooks) may not reflect conversational or accented speech.
- **Device & environment sensitivity:** Intensity depends on microphone, distance, and preprocessing.
- **Domain shift:** Degradation is expected on far‑field/noisy/multilingual inputs without adaptation.

### Recommendations
- Calibrate or post‑normalize intensity for your capture setup.
- Report WER and regression errors by domain (mic type, SNR buckets, etc.). Keep a human in the loop for sensitive deployments.

## How to Get Started with the Model

### Environment
```bash
python -m venv .venv
source .venv/bin/activate      # Windows: .venv\Scripts\activate
pip install -r requirements.txt
```

### Train (Whisper backbone)
```bash
python -m src.speech_mtl.training.train_whisper   --model_name openai/whisper-small   --language en   --dataset librispeech_asr   --train_split train.clean.100   --eval_split validation.clean   --text_column text   --num_train_epochs 1   --output_dir outputs/whisper_small_mtl
```

### Train (Wav2Vec2‑CTC backbone)
```bash
python -m src.speech_mtl.training.train_wav2vec2   --model_name facebook/wav2vec2-base-960h   --dataset librispeech_asr   --train_split train.clean.100   --eval_split validation.clean   --text_column text   --max_train_samples 1000   --max_eval_samples 150   --num_train_epochs 1   --output_dir outputs/wav2vec2_base_mtl
```

### Evaluate
```bash
python -m src.speech_mtl.eval.evaluate   --whisper_model_dir outputs/whisper_small_mtl   --wav2vec2_model_dir outputs/wav2vec2_base_mtl   --dataset librispeech_asr --split test.clean --text_column text
```

### Inference (CLI)
```bash
python -m src.speech_mtl.inference.predict   --model whisper   --checkpoint outputs/whisper_small_mtl   --audio path/to/audio.wav
```

### Gradio Demo
```bash
python app/gradio_app.py --model whisper --checkpoint outputs/whisper_small_mtl
# or
python app/gradio_app.py --model wav2vec2 --checkpoint outputs/wav2vec2_base_mtl
```

## Training Details

### Training Data
- **Default:** `librispeech_asr` (`train.clean.100`; eval on `validation.clean` / `test.clean`).
- **Optional:** `mozilla-foundation/common_voice_13_0` via `--dataset` and `--language`.

**Intensity targets:** computed from audio RMS dBFS bounded to `[-60, 0]`, then normalized to `[0, 1]`:

```text
norm_intensity = (dbfs + 60) / 60
```

### Training Procedure

#### Preprocessing
- Load/resample to 16 kHz per backbone requirements.
- Compute intensity labels from raw audio; LUFS (via `pyloudnorm`) can be used as an alternative.

#### Training Hyperparameters
- **Training regime:** fp16 mixed precision when available; batch size and LR configured via `configs/*.yaml`.

#### Speeds, Sizes, Times
- Example single‑epoch fine‑tuned weights are linked in the repo README (`training-logs/` contains logs).

## Evaluation

### Testing Data, Factors & Metrics
- **Testing Data:** LibriSpeech `test.clean` by default; optionally Common Voice.
- **Factors:** noise level, microphone/domain, utterance length.
- **Metrics:**
  - **ASR:** Word Error Rate (WER)
  - **Intensity regression:** MAE, MSE, and R²

 ### Results
## 📊 Training Logs & Metrics

- **Total FLOPs (training):** `11,971,980,681,992,470,000`  
- **Training runtime:** `9,579.8516` seconds  for 3 `epoch` 
- **Logging:** TensorBoard-compatible logs in `src/checkpoint/logs`  

You can monitor training live with:

## ✅ Full Metrics
### 🔎 Highlights
- **Validation WER (↓):** **12.897%**  _(0.128966 as fraction)_
- **Validation Loss:** **21.7842**
- Fast eval throughput: **17.05 samples/s** • **4.264 steps/s**

> **WER** from `jiwer.wer` (fraction in \[0,1\]; percent shown for readability).  
> This run uses a **CTC** objective for ASR and an auxiliary **intensity** head (multi‑task), but only ASR metrics were logged during evaluation.

#### Validation (Dev)
| Metric | Value |
|---|---|
| **Loss** | **21.7842** |
| **WER (↓)** | **0.128966**  _(12.897%)_ |
| **Runtime (s)** | **158.5324**  _(≈ 2m 39s)_ |
| **Samples / s** | **17.050** |
| **Steps / s** | **4.264** |
| **Epoch** | **2.8** |

#### Training Summary
| Metric | Value |
|---|---|
| **Train Loss** | **227.4951** |
| **Runtime (s)** | **9,579.8514**  _(≈ 2h 39m 40s)_ |
| **Samples / s** | **8.937** |
| **Steps / s** | **0.559** |
| **Epochs** | **3.0** |

---
#### Summary
Multitask objective = ASR loss + intensity regression loss (weight controlled by `--lambda_intensity`).

## Model Examination
Inspect encoder representations/saliency to see which frames contribute most to intensity prediction.

## Environmental Impact
- **Hardware Type:** Laptop GPU
- **GPU:** NVIDIA GeForce RTX 3080 Ti Laptop (16 GB VRAM)

## Technical Specifications
### Model Architecture and Objective
- **Wav2Vec2‑CTC variant:** Transformer encoder with CTC head for ASR + attention‑masked mean‑pooled regressor.

### Compute Infrastructure
- **Hardware:** Laptop with NVIDIA RTX 3080 Ti (16 GB).
- **Software:** Python, PyTorch, Hugging Face `transformers`/`datasets`, Gradio.

## Citation
If you build on this work, please cite the repository.

**BibTeX:**
```bibtex
@misc{yousefi2025speechmtl,
  title        = {Speech Multitask End-to-End (ASR + Intensity Regression)},
  author       = {Yousefi, Amirhossein},
  year         = {2025},
  howpublished = {GitHub repository},
  url          = {https://github.com/amirhossein-yousefi/speech2text-intensity-regression-wav2vec}
}
```

**APA:**  
Yousefi, A. (2025). *Speech Multitask End‑to‑End (ASR + Intensity Regression)* [Computer software]. GitHub. https://github.com/amirhossein-yousefi/speech2text-intensity-regression-wav2vec

## More Information
- Configs: `configs/wav2vec2_base.yaml`
- Deployment: Amazon SageMaker packaging/inference under `sagemaker/`

## Model Card Contact
Please open an issue in the GitHub repository.