Model Card for ASR (CTC-based ASR on English)
This repository contains an end‑to‑end Automatic Speech Recognition (ASR) pipeline built around Hugging Face Transformers. The default configuration fine‑tunes facebook/wav2vec2-base-960h
with a CTC head on 50k subsample of Common Voice 17.0 (English) and provides scripts to train, evaluate, export to ONNX, and deploy on AWS SageMaker. It also includes a robust audio loading stack (FFmpeg preferred, with fallbacks) and utilities for text normalization and evaluation (WER/CER).
Model Details
Model Description
- Developed by: Amirhossein Yousefi (GitHub:
@amirhossein-yousefi
) - Funded by : Not specified
- Shared by : Amirhossein Yousefi
- Model type: CTC-based ASR using Transformers (Wav2Vec2ForCTC)
- Language(s) (NLP): English (
en
) - License: Base model is Apache-2.0; repository/fine-tuned weights license not explicitly stated here (treat as other until clarified)
- Finetuned from model :
facebook/wav2vec2-base-960h
The training/evaluation pipeline uses Hugging Face
transformers
,datasets
, andjiwer
and includes scripts for inference and SageMaker deployment.
Model Sources
- Repository: https://github.com/amirhossein-yousefi/ASR
- Paper : Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (arXiv:2006.11477)
- Demo : N/A (local CLI and SageMaker examples included)
Uses
Direct Use
- General‑purpose English speech transcription for short to moderate audio segments (default duration filter: ~1–18 seconds).
- Local batch transcription via CLI or Python, or real‑time deployment via AWS SageMaker (JSON base64 or raw WAV content types).
Downstream Use
- Domain adaptation / further fine‑tuning on task‑ or accent‑specific datasets.
- Export to ONNX for CPU‑friendly inference and integration in production applications.
Out-of-Scope Use
- Speaker diarization, punctuation restoration, and true streaming ASR are not included.
- Multilingual or code‑switched speech without additional fine‑tuning.
- Very long files without chunking; heavy background noise without augmentation/tuning.
Bias, Risks, and Limitations
- The default fine‑tuning dataset (Common Voice 17.0, English) can reflect collection biases (microphone quality, accents, demographics). Accuracy may degrade on out‑of‑domain audio (e.g., telephony, medical terms).
- Transcriptions may contain mistakes and can include sensitive/PII if present in audio; handle outputs responsibly.
Recommendations
- Always evaluate WER/CER on your own hold‑out data. Consider adding punctuation casing models and domain vocabularies as needed.
- For regulated contexts, incorporate a human‑in‑the‑loop review and data governance.
How to Get Started with the Model
Python (local inference):
import torch, torchaudio
from transformers import AutoModelForCTC, AutoProcessor
model_dir = "./outputs/asr" # or a Hugging Face hub id
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_dir)
model = AutoModelForCTC.from_pretrained(model_dir).to(device).eval()
wav, sr = torchaudio.load("path/to/file.wav")
target_sr = processor.feature_extractor.sampling_rate
if sr != target_sr:
wav = torchaudio.functional.resample(wav, sr, target_sr)
inputs = processor(wav.squeeze(0).numpy(), sampling_rate=target_sr, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(**{k: v.to(device) for k, v in inputs.items()}).logits
pred_ids = torch.argmax(logits, dim=-1)
print(processor.batch_decode(pred_ids.cpu().numpy())[0])
CLI (example):
python src/infer.py --model_dir ./outputs/asr --audio path/to/file.wav
Training Details
Training Data
- Dataset: Common Voice 17.0 (English), text column:
sentence
- Duration filter: min ~1.0s, max ~18.0s
- Notes: Case‑aware normalization, whitelist filtering to match tokenizer vocabulary; optional waveform augmentations.
Training Procedure
Preprocessing [optional]
- Robust audio decoding (FFmpeg preferred on Windows; fallback to
torchaudio/soundfile/librosa
), resampling to 16 kHz as required by Wav2Vec2. - Tokenization via the model’s processor; dynamic padding with a CTC collator.
Training Hyperparameters
- Epochs: 3
- Per‑device batch size: 8 (× 8 grad accumulation → effective 64)
- Learning rate: 3e‑5
- Warmup ratio: 0.05
- Optimizer:
adamw_torch_fused
- Weight decay: 0.0
- Precision: FP16
- Max grad norm: 1.0
- Logging: every 50 steps; Eval/Save: every 500 steps; keep last 2 checkpoints; early stopping patience = 3
- Seed: 42
Speeds, Sizes, Times [optional]
- Total FLOPs (training): 10,814,747,992,293,114,000
- Training runtime: ~11,168 s for 2,346 steps
- Logs: TensorBoard at
src/output/logs
(or similar path as configured)
Evaluation
Testing Data, Factors & Metrics
- Metrics: WER (primary) and CER (auxiliary), computed with
jiwer
utilities. - Factors: English speech across CV17 splits; performance varies by accent, recording conditions, and utterance length.
Results
- Training includes loss, eval WER, and eval CER curves. See the
assets/
directory for plots.
Summary
- Baseline WER/CER are logged per‑eval; users should report domain‑specific results on their own datasets.
Model Examination
- Greedy decoding by default; beam search/LM fusion is not included in this repo. Inspect logits and alignments if needed for error analysis.
Environmental Impact
- Hardware Type: Laptop (Windows)
- GPU: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM), Driver 576.52
- CUDA / PyTorch: CUDA 12.9, PyTorch 2.8.0+cu129
- Hours used: ~3.1 h (approx.)
- Cloud Provider: N/A for local; AWS SageMaker utilities available for cloud training/deployment
- Compute Region: N/A (local)
- Carbon Emitted: Not calculated; estimate with the MLCO2 calculator
Technical Specifications
Model Architecture and Objective
- Architecture: Wav2Vec2 encoder with CTC output layer
- Objective: Character‑level CTC loss for ASR
Compute Infrastructure
Hardware
- Local GPU as above; or AWS instance types via SageMaker scripts (e.g.,
ml.g4dn.xlarge
).
Software
- Python 3.10+
- Key dependencies:
transformers
,datasets
,torch
,torchaudio
,soundfile
,librosa
,jiwer
,onnxruntime
(for ONNX testing), andboto3
/sagemaker
for deployment.
Citation
BibTeX:
@article{baevski2020wav2vec,
title={wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
journal={arXiv preprint arXiv:2006.11477},
year={2020}
}
APA: Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self‑supervised learning of speech representations. arXiv:2006.11477.
Glossary
- WER: Word Error Rate; lower is better.
- CER: Character Error Rate; lower is better.
- CTC: Connectionist Temporal Classification, an alignment‑free loss for sequence labeling.
More Information
- ONNX export:
src/export_onnx.py
- AWS SageMaker: scripts in
sagemaker/
for training, deployment, and autoscaling. - Training/metrics plots: see
assets/
(e.g.,train_loss.svg
,eval_wer.svg
,eval_cer.svg
).
Model Card Authors
- Amirhossein Yousefi (repo author)
Model Card Contact
- Open an issue on the GitHub repository: https://github.com/amirhossein-yousefi/ASR
- Downloads last month
- 18
Model tree for Amirhossein75/ASR
Base model
facebook/wav2vec2-base-960h