Model Card for ASR (CTC-based ASR on English)

This repository contains an end‑to‑end Automatic Speech Recognition (ASR) pipeline built around Hugging Face Transformers. The default configuration fine‑tunes facebook/wav2vec2-base-960h with a CTC head on 50k subsample of Common Voice 17.0 (English) and provides scripts to train, evaluate, export to ONNX, and deploy on AWS SageMaker. It also includes a robust audio loading stack (FFmpeg preferred, with fallbacks) and utilities for text normalization and evaluation (WER/CER).

Model Details

Model Description

Developed by: Amirhossein Yousefi (GitHub: @amirhossein-yousefi)
Funded by : Not specified
Shared by : Amirhossein Yousefi
Model type: CTC-based ASR using Transformers (Wav2Vec2ForCTC)
Language(s) (NLP): English (en)
License: Base model is Apache-2.0; repository/fine-tuned weights license not explicitly stated here (treat as other until clarified)
Finetuned from model : facebook/wav2vec2-base-960h

The training/evaluation pipeline uses Hugging Face transformers, datasets, and jiwer and includes scripts for inference and SageMaker deployment.

Model Sources

Repository: https://github.com/amirhossein-yousefi/ASR
Paper : Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (arXiv:2006.11477)
Demo : N/A (local CLI and SageMaker examples included)

Uses

Direct Use

General‑purpose English speech transcription for short to moderate audio segments (default duration filter: ~1–18 seconds).
Local batch transcription via CLI or Python, or real‑time deployment via AWS SageMaker (JSON base64 or raw WAV content types).

Downstream Use

Domain adaptation / further fine‑tuning on task‑ or accent‑specific datasets.
Export to ONNX for CPU‑friendly inference and integration in production applications.

Out-of-Scope Use

Speaker diarization, punctuation restoration, and true streaming ASR are not included.
Multilingual or code‑switched speech without additional fine‑tuning.
Very long files without chunking; heavy background noise without augmentation/tuning.

Bias, Risks, and Limitations

The default fine‑tuning dataset (Common Voice 17.0, English) can reflect collection biases (microphone quality, accents, demographics). Accuracy may degrade on out‑of‑domain audio (e.g., telephony, medical terms).
Transcriptions may contain mistakes and can include sensitive/PII if present in audio; handle outputs responsibly.

Recommendations

Always evaluate WER/CER on your own hold‑out data. Consider adding punctuation casing models and domain vocabularies as needed.
For regulated contexts, incorporate a human‑in‑the‑loop review and data governance.

How to Get Started with the Model

Python (local inference):

import torch, torchaudio
from transformers import AutoModelForCTC, AutoProcessor

model_dir = "./outputs/asr"  # or a Hugging Face hub id
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_dir)
model = AutoModelForCTC.from_pretrained(model_dir).to(device).eval()

wav, sr = torchaudio.load("path/to/file.wav")
target_sr = processor.feature_extractor.sampling_rate
if sr != target_sr:
    wav = torchaudio.functional.resample(wav, sr, target_sr)

inputs = processor(wav.squeeze(0).numpy(), sampling_rate=target_sr, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(**{k: v.to(device) for k, v in inputs.items()}).logits
pred_ids = torch.argmax(logits, dim=-1)
print(processor.batch_decode(pred_ids.cpu().numpy())[0])

CLI (example):

python src/infer.py --model_dir ./outputs/asr --audio path/to/file.wav

Training Details

Training Data

Dataset: Common Voice 17.0 (English), text column: sentence
Duration filter: min ~1.0s, max ~18.0s
Notes: Case‑aware normalization, whitelist filtering to match tokenizer vocabulary; optional waveform augmentations.

Training Procedure

Preprocessing [optional]

Robust audio decoding (FFmpeg preferred on Windows; fallback to torchaudio/soundfile/librosa), resampling to 16 kHz as required by Wav2Vec2.
Tokenization via the model’s processor; dynamic padding with a CTC collator.

Training Hyperparameters

Epochs: 3
Per‑device batch size: 8 (× 8 grad accumulation → effective 64)
Learning rate: 3e‑5
Warmup ratio: 0.05
Optimizer: adamw_torch_fused
Weight decay: 0.0
Precision: FP16
Max grad norm: 1.0
Logging: every 50 steps; Eval/Save: every 500 steps; keep last 2 checkpoints; early stopping patience = 3
Seed: 42

Speeds, Sizes, Times [optional]

Total FLOPs (training): 10,814,747,992,293,114,000
Training runtime: ~11,168 s for 2,346 steps
Logs: TensorBoard at src/output/logs (or similar path as configured)

Evaluation

Testing Data, Factors & Metrics

Metrics: WER (primary) and CER (auxiliary), computed with jiwer utilities.
Factors: English speech across CV17 splits; performance varies by accent, recording conditions, and utterance length.

Results

Training includes loss, eval WER, and eval CER curves. See the assets/ directory for plots.

Summary

Baseline WER/CER are logged per‑eval; users should report domain‑specific results on their own datasets.

Model Examination

Greedy decoding by default; beam search/LM fusion is not included in this repo. Inspect logits and alignments if needed for error analysis.

Environmental Impact

Hardware Type: Laptop (Windows)
GPU: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM), Driver 576.52
CUDA / PyTorch: CUDA 12.9, PyTorch 2.8.0+cu129
Hours used: ~3.1 h (approx.)
Cloud Provider: N/A for local; AWS SageMaker utilities available for cloud training/deployment
Compute Region: N/A (local)
Carbon Emitted: Not calculated; estimate with the MLCO2 calculator

Technical Specifications

Model Architecture and Objective

Architecture: Wav2Vec2 encoder with CTC output layer
Objective: Character‑level CTC loss for ASR

Compute Infrastructure

Hardware

Local GPU as above; or AWS instance types via SageMaker scripts (e.g., ml.g4dn.xlarge).

Software

Python 3.10+
Key dependencies: transformers, datasets, torch, torchaudio, soundfile, librosa, jiwer, onnxruntime (for ONNX testing), and boto3/sagemaker for deployment.

Citation

BibTeX:

@article{baevski2020wav2vec,
  title={wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
  author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
  journal={arXiv preprint arXiv:2006.11477},
  year={2020}
}

APA: Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self‑supervised learning of speech representations. arXiv:2006.11477.

Glossary

WER: Word Error Rate; lower is better.
CER: Character Error Rate; lower is better.
CTC: Connectionist Temporal Classification, an alignment‑free loss for sequence labeling.

More Information

ONNX export: src/export_onnx.py
AWS SageMaker: scripts in sagemaker/ for training, deployment, and autoscaling.
Training/metrics plots: see assets/ (e.g., train_loss.svg, eval_wer.svg, eval_cer.svg).

Model Card Authors

Amirhossein Yousefi (repo author)

Model Card Contact

Open an issue on the GitHub repository: https://github.com/amirhossein-yousefi/ASR

Amirhossein75
/

ASR