--- datasets: - mozilla-foundation/common_voice_17_0 - openslr/openslr language: - bn metrics: - wer - cer base_model: - facebook/w2v-bert-2.0 pipeline_tag: automatic-speech-recognition library_name: transformers tags: - asr - bangla - bangla-asr - wav2vec-bert - wav2vec-bert-bangla license: cc-by-sa-4.0 --- # Model Card for Shrutimala Bangla ASR ## Model Details ### Model Description This model is a fine-tuned version of `facebook/w2v-bert-2.0` for automatic speech recognition (ASR) in Bangla. The model has been trained on a large Bangla dataset, primarily sourced from Mozilla Common Voice 17.0, Common Voice 20.0, OpenSLR and achieves a Word Error Rate (WER) of 11%. - **Developed by:** Sazzadul Islam - **Model type:** Wav2Vec-BERT-based Bangla ASR model - **Language(s):** Bangla (bn) - **License:** CC-BY-SA-4.0 - **Fine-tuned from:** `facebook/w2v-bert-2.0` ## Uses ### Direct Use This model can be used for automatic speech recognition (ASR) in Bangla and English, with applications in transcription, voice assistants, and accessibility tools. ### Downstream Use It can be further fine-tuned for domain-specific ASR tasks, including medical or legal transcription in Bangla. ### Out-of-Scope Use - Not suitable for real-time ASR on low-power devices without optimization. - May not perform well on noisy environments or highly accented regional dialects outside the training data. ## Bias, Risks, and Limitations - The model may struggle with low-resource dialects and uncommon speech patterns. - Biases may exist due to dataset imbalances in gender, age, or socio-economic backgrounds. - Ethical considerations should be taken when using the model for surveillance or sensitive applications. ## How to Get Started with the Model Use the following code snippet to load the model: ```python from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC import torch processor = Wav2Vec2Processor.from_pretrained("your_model_id") model = Wav2Vec2ForCTC.from_pretrained("your_model_id") # Load and process audio file audio_input = ... # Provide audio tensor inputs = processor(audio_input, return_tensors="pt", sampling_rate=16000) # Perform ASR with torch.no_grad(): logits = model(**inputs).logits predicted_ids = torch.argmax(logits, dim=-1) transcription = processor.batch_decode(predicted_ids) print(transcription) ``` ## Training Details ### Training Data The model was trained on the Mozilla Common Voice 17.0, Common Voice 20.0 and OpenSLR dataset for Bangla. ### Training Procedure #### Preprocessing - Audio was resampled to 16kHz-8kHz-16kHz. - Transcripts were normalized to improve ASR performance. #### Training Hyperparameters - **Batch Size:** 16 - **Learning Rate:** 1e-5 - **Training Steps:** 25000 - **Mixed Precision:** FP16 #### Training Time and Compute - **Hardware Used:** RTX 4090 - **Training Time:** 37 Hours - **Dataset Size:** 143k ## Evaluation ### Testing Data & Metrics #### Metrics - **WER:** 11.26% - **CER:** 2.39 #### Factors The model was evaluated on: - Standard Bangla speech - Various speaker demographics ### Results - Performs well on clear, standard Bangla speech. - Struggles with strong regional accents and noisy environments. ## Technical Specifications ### Model Architecture The model is based on `facebook/w2v-bert-2.0`, a hybrid Wav2Vec2-BERT model for ASR. ### Citation This model is based on the research presented in the following paper. If you use this model, please cite the original authors: ``` @misc{ridoy2025adaptabilityasrmodelslowresource, title={Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla}, author={Md Sazzadul Islam Ridoy and Sumi Akter and Md. Aminur Rahman}, year={2025}, eprint={2507.01931}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2507.01931}, } ``` ## Contact For any issues or inquiries, please contact isazzadul23@gmail.com.