Model Card for ASR (CTC-based ASR on English)

This repository contains an end‑to‑end Automatic Speech Recognition (ASR) pipeline built around Hugging Face Transformers. The default configuration fine‑tunes facebook/wav2vec2-base-960h with a CTC head on 50k subsample of Common Voice 17.0 (English) and provides scripts to train, evaluate, export to ONNX, and deploy on AWS SageMaker. It also includes a robust audio loading stack (FFmpeg preferred, with fallbacks) and utilities for text normalization and evaluation (WER/CER).

Model Details

Model Description

  • Developed by: Amirhossein Yousefi (GitHub: @amirhossein-yousefi)
  • Funded by : Not specified
  • Shared by : Amirhossein Yousefi
  • Model type: CTC-based ASR using Transformers (Wav2Vec2ForCTC)
  • Language(s) (NLP): English (en)
  • License: Base model is Apache-2.0; repository/fine-tuned weights license not explicitly stated here (treat as other until clarified)
  • Finetuned from model : facebook/wav2vec2-base-960h

The training/evaluation pipeline uses Hugging Face transformers, datasets, and jiwer and includes scripts for inference and SageMaker deployment.

Model Sources

  • Repository: https://github.com/amirhossein-yousefi/ASR
  • Paper : Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (arXiv:2006.11477)
  • Demo : N/A (local CLI and SageMaker examples included)

Uses

Direct Use

  • General‑purpose English speech transcription for short to moderate audio segments (default duration filter: ~1–18 seconds).
  • Local batch transcription via CLI or Python, or real‑time deployment via AWS SageMaker (JSON base64 or raw WAV content types).

Downstream Use

  • Domain adaptation / further fine‑tuning on task‑ or accent‑specific datasets.
  • Export to ONNX for CPU‑friendly inference and integration in production applications.

Out-of-Scope Use

  • Speaker diarization, punctuation restoration, and true streaming ASR are not included.
  • Multilingual or code‑switched speech without additional fine‑tuning.
  • Very long files without chunking; heavy background noise without augmentation/tuning.

Bias, Risks, and Limitations

  • The default fine‑tuning dataset (Common Voice 17.0, English) can reflect collection biases (microphone quality, accents, demographics). Accuracy may degrade on out‑of‑domain audio (e.g., telephony, medical terms).
  • Transcriptions may contain mistakes and can include sensitive/PII if present in audio; handle outputs responsibly.

Recommendations

  • Always evaluate WER/CER on your own hold‑out data. Consider adding punctuation casing models and domain vocabularies as needed.
  • For regulated contexts, incorporate a human‑in‑the‑loop review and data governance.

How to Get Started with the Model

Python (local inference):

import torch, torchaudio
from transformers import AutoModelForCTC, AutoProcessor

model_dir = "./outputs/asr"  # or a Hugging Face hub id
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(model_dir)
model = AutoModelForCTC.from_pretrained(model_dir).to(device).eval()

wav, sr = torchaudio.load("path/to/file.wav")
target_sr = processor.feature_extractor.sampling_rate
if sr != target_sr:
    wav = torchaudio.functional.resample(wav, sr, target_sr)

inputs = processor(wav.squeeze(0).numpy(), sampling_rate=target_sr, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(**{k: v.to(device) for k, v in inputs.items()}).logits
pred_ids = torch.argmax(logits, dim=-1)
print(processor.batch_decode(pred_ids.cpu().numpy())[0])

CLI (example):

python src/infer.py --model_dir ./outputs/asr --audio path/to/file.wav

Training Details

Training Data

  • Dataset: Common Voice 17.0 (English), text column: sentence
  • Duration filter: min ~1.0s, max ~18.0s
  • Notes: Case‑aware normalization, whitelist filtering to match tokenizer vocabulary; optional waveform augmentations.

Training Procedure

Preprocessing [optional]

  • Robust audio decoding (FFmpeg preferred on Windows; fallback to torchaudio/soundfile/librosa), resampling to 16 kHz as required by Wav2Vec2.
  • Tokenization via the model’s processor; dynamic padding with a CTC collator.

Training Hyperparameters

  • Epochs: 3
  • Per‑device batch size: 8 (× 8 grad accumulation → effective 64)
  • Learning rate: 3e‑5
  • Warmup ratio: 0.05
  • Optimizer: adamw_torch_fused
  • Weight decay: 0.0
  • Precision: FP16
  • Max grad norm: 1.0
  • Logging: every 50 steps; Eval/Save: every 500 steps; keep last 2 checkpoints; early stopping patience = 3
  • Seed: 42

Speeds, Sizes, Times [optional]

  • Total FLOPs (training): 10,814,747,992,293,114,000
  • Training runtime: ~11,168 s for 2,346 steps
  • Logs: TensorBoard at src/output/logs (or similar path as configured)

Evaluation

Testing Data, Factors & Metrics

  • Metrics: WER (primary) and CER (auxiliary), computed with jiwer utilities.
  • Factors: English speech across CV17 splits; performance varies by accent, recording conditions, and utterance length.

Results

  • Training includes loss, eval WER, and eval CER curves. See the assets/ directory for plots.

Summary

  • Baseline WER/CER are logged per‑eval; users should report domain‑specific results on their own datasets.

Model Examination

  • Greedy decoding by default; beam search/LM fusion is not included in this repo. Inspect logits and alignments if needed for error analysis.

Environmental Impact

  • Hardware Type: Laptop (Windows)
  • GPU: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM), Driver 576.52
  • CUDA / PyTorch: CUDA 12.9, PyTorch 2.8.0+cu129
  • Hours used: ~3.1 h (approx.)
  • Cloud Provider: N/A for local; AWS SageMaker utilities available for cloud training/deployment
  • Compute Region: N/A (local)
  • Carbon Emitted: Not calculated; estimate with the MLCO2 calculator

Technical Specifications

Model Architecture and Objective

  • Architecture: Wav2Vec2 encoder with CTC output layer
  • Objective: Character‑level CTC loss for ASR

Compute Infrastructure

Hardware

  • Local GPU as above; or AWS instance types via SageMaker scripts (e.g., ml.g4dn.xlarge).

Software

  • Python 3.10+
  • Key dependencies: transformers, datasets, torch, torchaudio, soundfile, librosa, jiwer, onnxruntime (for ONNX testing), and boto3/sagemaker for deployment.

Citation

BibTeX:

@article{baevski2020wav2vec,
  title={wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
  author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
  journal={arXiv preprint arXiv:2006.11477},
  year={2020}
}

APA: Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self‑supervised learning of speech representations. arXiv:2006.11477.

Glossary

  • WER: Word Error Rate; lower is better.
  • CER: Character Error Rate; lower is better.
  • CTC: Connectionist Temporal Classification, an alignment‑free loss for sequence labeling.

More Information

  • ONNX export: src/export_onnx.py
  • AWS SageMaker: scripts in sagemaker/ for training, deployment, and autoscaling.
  • Training/metrics plots: see assets/ (e.g., train_loss.svg, eval_wer.svg, eval_cer.svg).

Model Card Authors

  • Amirhossein Yousefi (repo author)

Model Card Contact

Downloads last month
18
Safetensors
Model size
94.4M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Amirhossein75/ASR

Finetuned
(152)
this model

Dataset used to train Amirhossein75/ASR