--- license: mit language: - ur metrics: - wer - cer base_model: - facebook/wav2vec2-xls-r-1b pipeline_tag: automatic-speech-recognition tags: - asr - urdu - pytorch - STT - Transformers - transcription datasets: - azeem-ahmed/Common_Voice_Corpus_22_0_Urdu library_name: transformers --- # Wav2Vec2-XLS-R-1B Fine-Tuned for Urdu ASR πŸŽ™οΈπŸ‡΅πŸ‡° This repository hosts a fine-tuned version of **[facebook/wav2vec2-xls-r-1b](https://huggingface.co/facebook/wav2vec2-xls-r-1b)** for **Automatic Speech Recognition (ASR) in Urdu**. The model has been trained on the **Common Voice Corpus 22.0 (Urdu subset)** with extensive enhancements in preprocessing, error handling, and training monitoring. --- ## ✨ Highlights - **Base Model**: facebook/wav2vec2-xls-r-1b (1B parameters, multilingual) - **Target Language**: Urdu - **Dataset**: [Mozilla Common Voice 22.0 (Urdu)](https://commonvoice.mozilla.org/en/datasets) - **Training Framework**: Hugging Face Transformers + Datasets - **Metrics Logged**: Training Loss, Validation Loss, WER, CER - **Hardware**: Single NVIDIA RTX 4090 (24 GB VRAM) - **Optimizations**: - FP16 mixed precision - Gradient checkpointing - RTX 4090–specific CUDA/TF32 tuning - Early stopping & loss monitoring - **Robust Preprocessing**: Custom Urdu text cleaner, enhanced audio validation, dynamic vocabulary generation - **Comprehensive Tracking**: Weights & Biases integration, CSV logging, and Markdown summary reports --- ### πŸ—οΈ Model Architecture - Base: `facebook/wav2vec2-xls-r-1b` - Architecture: Wav2Vec2-CTC (Connectionist Temporal Classification) - Feature encoder: Frozen during fine-tuning - Dropouts (for regularization): - Attention: 0.1 - Activation: 0.1 - Hidden: 0.1 - Feature projection: 0.0 - Final: 0.0 **Hyperparameters** - Batch Size: - Train: 4 (gradient accumulation = 2 β†’ effective batch = 8) - Eval: 8 - Learning Rate: 3e-5 - Optimizer: AdamW with weight decay = 0.01 - Warmup Steps: 1000 - Max Grad Norm: 1.0 - Epochs: 30 - Save/Eval Steps: 1000 - Logging Steps: 25 - Early Stopping: patience = 5 **Metrics** - Word Error Rate (WER) - Character Error Rate (CER) - Training/Validation Loss (with NaN/Inf safeguards) --- ## πŸš€ Model Performance This model achieves exceptional performance on Urdu speech recognition: - **Best WER (Word Error Rate): 33.75%** - **Best CER (Character Error Rate): 27.00%** - **44.7% improvement** from initial performance - Robust performance across 30 training epochs ## πŸ“Š Training Metrics The model was trained for **30 epochs** with a batch size optimized for the RTX 4090. Metrics were logged continuously. |Step |Epoch|Training Loss|Validation Loss|WER |CER | |-----|-----|-------------|---------------|------|------| |1000 |1.09 |3.1996 |1.0216 |0.6107|0.4886| |2000 |2.18 |5.5422 |0.8069 |0.4751|0.3801| |3000 |3.28 |3.8995 |0.7641 |0.4441|0.3553| |4000 |4.37 |1.7375 |0.714 |0.4175|0.334 | |5000 |5.46 |1.8486 |0.7205 |0.3998|0.3198| |6000 |6.55 |4.2864 |0.6949 |0.397 |0.3176| |7000 |7.64 |5.7143 |0.7016 |0.3783|0.3026| |8000 |8.73 |3.0777 |0.6733 |0.3817|0.3053| |9000 |9.83 |3.3163 |0.6827 |0.3646|0.2916| |10000|10.92|2.6399 |0.6645 |0.3647|0.2918| |11000|12.01|1.9039 |0.7104 |0.3684|0.2947| |12000|13.1 |2.7625 |0.693 |0.3624|0.2899| |13000|14.19|4.189 |0.7066 |0.3621|0.2897| |14000|15.28|4.8301 |0.7281 |0.3565|0.2852| |15000|16.38|2.8099 |0.7179 |0.354 |0.2832| |16000|17.47|2.191 |0.7339 |0.3527|0.2821| |17000|18.56|6.7916 |0.7245 |0.3589|0.2871| |18000|19.65|4.7375 |0.7599 |0.3485|0.2788| |19000|20.74|6.2273 |0.7414 |0.3471|0.2776| |20000|21.83|2.4164 |0.7877 |0.3519|0.2815| |21000|22.93|3.9591 |0.7595 |0.3422|0.2737| |22000|24.02|7.3049 |0.7994 |0.343 |0.2744| |23000|25.11|4.7571 |0.8182 |0.3457|0.2766| |24000|26.2 |2.9164 |0.8067 |0.3417|0.2733| |25000|27.29|4.1302 |0.8132 |0.3377|0.2701| |26000|28.38|4.2031 |0.8328 |0.3383|0.2707| |27000|29.48|1.2038 |0.8367 |0.3375|0.27 | |27480|30 |5.8839 |0.8261 |0.3376|0.2701| ## πŸ’» Usage ### 1. Install Dependencies ```bash pip install torch librosa soundfile transformers datasets jiwer ``` ```python from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC import torch, soundfile as sf processor = Wav2Vec2Processor.from_pretrained("azeem-ahmed/wav2vec2-xls-r-1b-urdu") model = Wav2Vec2ForCTC.from_pretrained("azeem-ahmed/wav2vec2-xls-r-1b-urdu") speech, sr = sf.read("sample.wav") inputs = processor(speech, sampling_rate=sr, return_tensors="pt", padding=True) with torch.no_grad(): logits = model(inputs.input_values).logits pred_ids = torch.argmax(logits, dim=-1) print(processor.batch_decode(pred_ids)[0]) ``` ### πŸ“œ Citation ``` @misc{azeem2025wav2vec2urdu, title={Fine-tuned Wav2Vec2-XLS-R-1B for Urdu ASR}, author={Ahmed, Azeem}, year={2025}, howpublished={\url{https://huggingface.co/azeem-ahmed/wav2vec2-xls-r-1b-urdu}}, } ``` ## πŸ™ Acknowledgements - Facebook AI Research for Wav2Vec2-XLS-R - Mozilla for Common Voice 22.0 - Hugging Face team - Weights & Biases for experiment tracking ##### 🌟 Star this repository if you find it useful! _Built with ❀️ for the Urdu language community_