Wav2Vec2-XLS-R-1B Fine-Tuned for Urdu ASR 🎙️🇵🇰

This repository hosts a fine-tuned version of facebook/wav2vec2-xls-r-1b for Automatic Speech Recognition (ASR) in Urdu.
The model has been trained on the Common Voice Corpus 22.0 (Urdu subset) with extensive enhancements in preprocessing, error handling, and training monitoring.

✨ Highlights

Base Model: facebook/wav2vec2-xls-r-1b (1B parameters, multilingual)
Target Language: Urdu
Dataset: Mozilla Common Voice 22.0 (Urdu)
Training Framework: Hugging Face Transformers + Datasets
Metrics Logged: Training Loss, Validation Loss, WER, CER
Hardware: Single NVIDIA RTX 4090 (24 GB VRAM)
Optimizations:
- FP16 mixed precision
- Gradient checkpointing
- RTX 4090–specific CUDA/TF32 tuning
- Early stopping & loss monitoring
Robust Preprocessing: Custom Urdu text cleaner, enhanced audio validation, dynamic vocabulary generation
Comprehensive Tracking: Weights & Biases integration, CSV logging, and Markdown summary reports

🏗️ Model Architecture

Base: facebook/wav2vec2-xls-r-1b
Architecture: Wav2Vec2-CTC (Connectionist Temporal Classification)
Feature encoder: Frozen during fine-tuning
Dropouts (for regularization):
- Attention: 0.1
- Activation: 0.1
- Hidden: 0.1
- Feature projection: 0.0
- Final: 0.0

Hyperparameters

Batch Size:
- Train: 4 (gradient accumulation = 2 → effective batch = 8)
- Eval: 8
Learning Rate: 3e-5
Optimizer: AdamW with weight decay = 0.01
Warmup Steps: 1000
Max Grad Norm: 1.0
Epochs: 30
Save/Eval Steps: 1000
Logging Steps: 25
Early Stopping: patience = 5

Metrics

Word Error Rate (WER)
Character Error Rate (CER)
Training/Validation Loss (with NaN/Inf safeguards)

🚀 Model Performance

This model achieves exceptional performance on Urdu speech recognition:

Best WER (Word Error Rate): 33.75%
Best CER (Character Error Rate): 27.00%
44.7% improvement from initial performance
Robust performance across 30 training epochs

📊 Training Metrics

The model was trained for 30 epochs with a batch size optimized for the RTX 4090. Metrics were logged continuously.

Step	Epoch	Training Loss	Validation Loss	WER	CER
1000	1.09	3.1996	1.0216	0.6107	0.4886
2000	2.18	5.5422	0.8069	0.4751	0.3801
3000	3.28	3.8995	0.7641	0.4441	0.3553
4000	4.37	1.7375	0.714	0.4175	0.334
5000	5.46	1.8486	0.7205	0.3998	0.3198
6000	6.55	4.2864	0.6949	0.397	0.3176
7000	7.64	5.7143	0.7016	0.3783	0.3026
8000	8.73	3.0777	0.6733	0.3817	0.3053
9000	9.83	3.3163	0.6827	0.3646	0.2916
10000	10.92	2.6399	0.6645	0.3647	0.2918
11000	12.01	1.9039	0.7104	0.3684	0.2947
12000	13.1	2.7625	0.693	0.3624	0.2899
13000	14.19	4.189	0.7066	0.3621	0.2897
14000	15.28	4.8301	0.7281	0.3565	0.2852
15000	16.38	2.8099	0.7179	0.354	0.2832
16000	17.47	2.191	0.7339	0.3527	0.2821
17000	18.56	6.7916	0.7245	0.3589	0.2871
18000	19.65	4.7375	0.7599	0.3485	0.2788
19000	20.74	6.2273	0.7414	0.3471	0.2776
20000	21.83	2.4164	0.7877	0.3519	0.2815
21000	22.93	3.9591	0.7595	0.3422	0.2737
22000	24.02	7.3049	0.7994	0.343	0.2744
23000	25.11	4.7571	0.8182	0.3457	0.2766
24000	26.2	2.9164	0.8067	0.3417	0.2733
25000	27.29	4.1302	0.8132	0.3377	0.2701
26000	28.38	4.2031	0.8328	0.3383	0.2707
27000	29.48	1.2038	0.8367	0.3375	0.27
27480	30	5.8839	0.8261	0.3376	0.2701

💻 Usage

1. Install Dependencies

pip install torch librosa soundfile transformers datasets jiwer

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch, soundfile as sf

processor = Wav2Vec2Processor.from_pretrained("azeem-ahmed/wav2vec2-xls-r-1b-urdu")
model = Wav2Vec2ForCTC.from_pretrained("azeem-ahmed/wav2vec2-xls-r-1b-urdu")

speech, sr = sf.read("sample.wav")
inputs = processor(speech, sampling_rate=sr, return_tensors="pt", padding=True)

with torch.no_grad():
    logits = model(inputs.input_values).logits

pred_ids = torch.argmax(logits, dim=-1)
print(processor.batch_decode(pred_ids)[0])

📜 Citation

@misc{azeem2025wav2vec2urdu,
  title={Fine-tuned Wav2Vec2-XLS-R-1B for Urdu ASR},
  author={Ahmed, Azeem},
  year={2025},
  howpublished={\url{https://huggingface.co/azeem-ahmed/wav2vec2-xls-r-1b-urdu}},
}

🙏 Acknowledgements

Facebook AI Research for Wav2Vec2-XLS-R
Mozilla for Common Voice 22.0
Hugging Face team
Weights & Biases for experiment tracking

🌟 Star this repository if you find it useful!

Built with ❤️ for the Urdu language community

Downloads last month: 32

Safetensors

Model size

963M params

Tensor type

F32

Model tree for azeem-ahmed/wav2vec2-xls-r-1b-urdu

Base model

facebook/wav2vec2-xls-r-1b

Finetuned

(111)

this model

azeem-ahmed
/

wav2vec2-xls-r-1b-urdu