Wav2Vec2-XLS-R-1B Fine-Tuned for Urdu ASR ποΈπ΅π°
This repository hosts a fine-tuned version of facebook/wav2vec2-xls-r-1b for Automatic Speech Recognition (ASR) in Urdu.
The model has been trained on the Common Voice Corpus 22.0 (Urdu subset) with extensive enhancements in preprocessing, error handling, and training monitoring.
β¨ Highlights
- Base Model: facebook/wav2vec2-xls-r-1b (1B parameters, multilingual)
- Target Language: Urdu
- Dataset: Mozilla Common Voice 22.0 (Urdu)
- Training Framework: Hugging Face Transformers + Datasets
- Metrics Logged: Training Loss, Validation Loss, WER, CER
- Hardware: Single NVIDIA RTX 4090 (24 GB VRAM)
- Optimizations:
- FP16 mixed precision
- Gradient checkpointing
- RTX 4090βspecific CUDA/TF32 tuning
- Early stopping & loss monitoring
- Robust Preprocessing: Custom Urdu text cleaner, enhanced audio validation, dynamic vocabulary generation
- Comprehensive Tracking: Weights & Biases integration, CSV logging, and Markdown summary reports
ποΈ Model Architecture
- Base:
facebook/wav2vec2-xls-r-1b
- Architecture: Wav2Vec2-CTC (Connectionist Temporal Classification)
- Feature encoder: Frozen during fine-tuning
- Dropouts (for regularization):
- Attention: 0.1
- Activation: 0.1
- Hidden: 0.1
- Feature projection: 0.0
- Final: 0.0
Hyperparameters
- Batch Size:
- Train: 4 (gradient accumulation = 2 β effective batch = 8)
- Eval: 8
- Learning Rate: 3e-5
- Optimizer: AdamW with weight decay = 0.01
- Warmup Steps: 1000
- Max Grad Norm: 1.0
- Epochs: 30
- Save/Eval Steps: 1000
- Logging Steps: 25
- Early Stopping: patience = 5
Metrics
- Word Error Rate (WER)
- Character Error Rate (CER)
- Training/Validation Loss (with NaN/Inf safeguards)
π Model Performance
This model achieves exceptional performance on Urdu speech recognition:
- Best WER (Word Error Rate): 33.75%
- Best CER (Character Error Rate): 27.00%
- 44.7% improvement from initial performance
- Robust performance across 30 training epochs
π Training Metrics
The model was trained for 30 epochs with a batch size optimized for the RTX 4090. Metrics were logged continuously.
Step | Epoch | Training Loss | Validation Loss | WER | CER |
---|---|---|---|---|---|
1000 | 1.09 | 3.1996 | 1.0216 | 0.6107 | 0.4886 |
2000 | 2.18 | 5.5422 | 0.8069 | 0.4751 | 0.3801 |
3000 | 3.28 | 3.8995 | 0.7641 | 0.4441 | 0.3553 |
4000 | 4.37 | 1.7375 | 0.714 | 0.4175 | 0.334 |
5000 | 5.46 | 1.8486 | 0.7205 | 0.3998 | 0.3198 |
6000 | 6.55 | 4.2864 | 0.6949 | 0.397 | 0.3176 |
7000 | 7.64 | 5.7143 | 0.7016 | 0.3783 | 0.3026 |
8000 | 8.73 | 3.0777 | 0.6733 | 0.3817 | 0.3053 |
9000 | 9.83 | 3.3163 | 0.6827 | 0.3646 | 0.2916 |
10000 | 10.92 | 2.6399 | 0.6645 | 0.3647 | 0.2918 |
11000 | 12.01 | 1.9039 | 0.7104 | 0.3684 | 0.2947 |
12000 | 13.1 | 2.7625 | 0.693 | 0.3624 | 0.2899 |
13000 | 14.19 | 4.189 | 0.7066 | 0.3621 | 0.2897 |
14000 | 15.28 | 4.8301 | 0.7281 | 0.3565 | 0.2852 |
15000 | 16.38 | 2.8099 | 0.7179 | 0.354 | 0.2832 |
16000 | 17.47 | 2.191 | 0.7339 | 0.3527 | 0.2821 |
17000 | 18.56 | 6.7916 | 0.7245 | 0.3589 | 0.2871 |
18000 | 19.65 | 4.7375 | 0.7599 | 0.3485 | 0.2788 |
19000 | 20.74 | 6.2273 | 0.7414 | 0.3471 | 0.2776 |
20000 | 21.83 | 2.4164 | 0.7877 | 0.3519 | 0.2815 |
21000 | 22.93 | 3.9591 | 0.7595 | 0.3422 | 0.2737 |
22000 | 24.02 | 7.3049 | 0.7994 | 0.343 | 0.2744 |
23000 | 25.11 | 4.7571 | 0.8182 | 0.3457 | 0.2766 |
24000 | 26.2 | 2.9164 | 0.8067 | 0.3417 | 0.2733 |
25000 | 27.29 | 4.1302 | 0.8132 | 0.3377 | 0.2701 |
26000 | 28.38 | 4.2031 | 0.8328 | 0.3383 | 0.2707 |
27000 | 29.48 | 1.2038 | 0.8367 | 0.3375 | 0.27 |
27480 | 30 | 5.8839 | 0.8261 | 0.3376 | 0.2701 |
π» Usage
1. Install Dependencies
pip install torch librosa soundfile transformers datasets jiwer
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch, soundfile as sf
processor = Wav2Vec2Processor.from_pretrained("azeem-ahmed/wav2vec2-xls-r-1b-urdu")
model = Wav2Vec2ForCTC.from_pretrained("azeem-ahmed/wav2vec2-xls-r-1b-urdu")
speech, sr = sf.read("sample.wav")
inputs = processor(speech, sampling_rate=sr, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values).logits
pred_ids = torch.argmax(logits, dim=-1)
print(processor.batch_decode(pred_ids)[0])
π Citation
@misc{azeem2025wav2vec2urdu,
title={Fine-tuned Wav2Vec2-XLS-R-1B for Urdu ASR},
author={Ahmed, Azeem},
year={2025},
howpublished={\url{https://huggingface.co/azeem-ahmed/wav2vec2-xls-r-1b-urdu}},
}
π Acknowledgements
- Facebook AI Research for Wav2Vec2-XLS-R
- Mozilla for Common Voice 22.0
- Hugging Face team
- Weights & Biases for experiment tracking
π Star this repository if you find it useful!
Built with β€οΈ for the Urdu language community
- Downloads last month
- 32
Model tree for azeem-ahmed/wav2vec2-xls-r-1b-urdu
Base model
facebook/wav2vec2-xls-r-1b