Wav2Vec2-XLS-R-1B Fine-Tuned for Urdu ASR πŸŽ™οΈπŸ‡΅πŸ‡°

This repository hosts a fine-tuned version of facebook/wav2vec2-xls-r-1b for Automatic Speech Recognition (ASR) in Urdu.
The model has been trained on the Common Voice Corpus 22.0 (Urdu subset) with extensive enhancements in preprocessing, error handling, and training monitoring.


✨ Highlights

  • Base Model: facebook/wav2vec2-xls-r-1b (1B parameters, multilingual)
  • Target Language: Urdu
  • Dataset: Mozilla Common Voice 22.0 (Urdu)
  • Training Framework: Hugging Face Transformers + Datasets
  • Metrics Logged: Training Loss, Validation Loss, WER, CER
  • Hardware: Single NVIDIA RTX 4090 (24 GB VRAM)
  • Optimizations:
    • FP16 mixed precision
    • Gradient checkpointing
    • RTX 4090–specific CUDA/TF32 tuning
    • Early stopping & loss monitoring
  • Robust Preprocessing: Custom Urdu text cleaner, enhanced audio validation, dynamic vocabulary generation
  • Comprehensive Tracking: Weights & Biases integration, CSV logging, and Markdown summary reports

πŸ—οΈ Model Architecture

  • Base: facebook/wav2vec2-xls-r-1b
  • Architecture: Wav2Vec2-CTC (Connectionist Temporal Classification)
  • Feature encoder: Frozen during fine-tuning
  • Dropouts (for regularization):
    • Attention: 0.1
    • Activation: 0.1
    • Hidden: 0.1
    • Feature projection: 0.0
    • Final: 0.0

Hyperparameters

  • Batch Size:
    • Train: 4 (gradient accumulation = 2 β†’ effective batch = 8)
    • Eval: 8
  • Learning Rate: 3e-5
  • Optimizer: AdamW with weight decay = 0.01
  • Warmup Steps: 1000
  • Max Grad Norm: 1.0
  • Epochs: 30
  • Save/Eval Steps: 1000
  • Logging Steps: 25
  • Early Stopping: patience = 5

Metrics

  • Word Error Rate (WER)
  • Character Error Rate (CER)
  • Training/Validation Loss (with NaN/Inf safeguards)

πŸš€ Model Performance

This model achieves exceptional performance on Urdu speech recognition:

  • Best WER (Word Error Rate): 33.75%
  • Best CER (Character Error Rate): 27.00%
  • 44.7% improvement from initial performance
  • Robust performance across 30 training epochs

πŸ“Š Training Metrics

The model was trained for 30 epochs with a batch size optimized for the RTX 4090. Metrics were logged continuously.

Step Epoch Training Loss Validation Loss WER CER
1000 1.09 3.1996 1.0216 0.6107 0.4886
2000 2.18 5.5422 0.8069 0.4751 0.3801
3000 3.28 3.8995 0.7641 0.4441 0.3553
4000 4.37 1.7375 0.714 0.4175 0.334
5000 5.46 1.8486 0.7205 0.3998 0.3198
6000 6.55 4.2864 0.6949 0.397 0.3176
7000 7.64 5.7143 0.7016 0.3783 0.3026
8000 8.73 3.0777 0.6733 0.3817 0.3053
9000 9.83 3.3163 0.6827 0.3646 0.2916
10000 10.92 2.6399 0.6645 0.3647 0.2918
11000 12.01 1.9039 0.7104 0.3684 0.2947
12000 13.1 2.7625 0.693 0.3624 0.2899
13000 14.19 4.189 0.7066 0.3621 0.2897
14000 15.28 4.8301 0.7281 0.3565 0.2852
15000 16.38 2.8099 0.7179 0.354 0.2832
16000 17.47 2.191 0.7339 0.3527 0.2821
17000 18.56 6.7916 0.7245 0.3589 0.2871
18000 19.65 4.7375 0.7599 0.3485 0.2788
19000 20.74 6.2273 0.7414 0.3471 0.2776
20000 21.83 2.4164 0.7877 0.3519 0.2815
21000 22.93 3.9591 0.7595 0.3422 0.2737
22000 24.02 7.3049 0.7994 0.343 0.2744
23000 25.11 4.7571 0.8182 0.3457 0.2766
24000 26.2 2.9164 0.8067 0.3417 0.2733
25000 27.29 4.1302 0.8132 0.3377 0.2701
26000 28.38 4.2031 0.8328 0.3383 0.2707
27000 29.48 1.2038 0.8367 0.3375 0.27
27480 30 5.8839 0.8261 0.3376 0.2701

πŸ’» Usage

1. Install Dependencies

pip install torch librosa soundfile transformers datasets jiwer
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch, soundfile as sf

processor = Wav2Vec2Processor.from_pretrained("azeem-ahmed/wav2vec2-xls-r-1b-urdu")
model = Wav2Vec2ForCTC.from_pretrained("azeem-ahmed/wav2vec2-xls-r-1b-urdu")

speech, sr = sf.read("sample.wav")
inputs = processor(speech, sampling_rate=sr, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values).logits

pred_ids = torch.argmax(logits, dim=-1)
print(processor.batch_decode(pred_ids)[0])

πŸ“œ Citation

@misc{azeem2025wav2vec2urdu,
  title={Fine-tuned Wav2Vec2-XLS-R-1B for Urdu ASR},
  author={Ahmed, Azeem},
  year={2025},
  howpublished={\url{https://huggingface.co/azeem-ahmed/wav2vec2-xls-r-1b-urdu}},
}

πŸ™ Acknowledgements

  • Facebook AI Research for Wav2Vec2-XLS-R
  • Mozilla for Common Voice 22.0
  • Hugging Face team
  • Weights & Biases for experiment tracking
🌟 Star this repository if you find it useful!

Built with ❀️ for the Urdu language community

Downloads last month
32
Safetensors
Model size
963M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for azeem-ahmed/wav2vec2-xls-r-1b-urdu

Finetuned
(111)
this model

Dataset used to train azeem-ahmed/wav2vec2-xls-r-1b-urdu