Model Card for Shrutimala Bangla ASR

Model Details

Model Description

This model is a fine-tuned version of facebook/w2v-bert-2.0 for automatic speech recognition (ASR) in Bangla. The model has been trained on a large Bangla dataset, primarily sourced from Mozilla Common Voice 17.0, Common Voice 20.0, OpenSLR and achieves a Word Error Rate (WER) of 11%.

Developed by: Sazzadul Islam
Model type: Wav2Vec-BERT-based Bangla ASR model
Language(s): Bangla (bn)
License: CC-BY-SA-4.0
Fine-tuned from: facebook/w2v-bert-2.0

Uses

Direct Use

This model can be used for automatic speech recognition (ASR) in Bangla and English, with applications in transcription, voice assistants, and accessibility tools.

Downstream Use

It can be further fine-tuned for domain-specific ASR tasks, including medical or legal transcription in Bangla.

Out-of-Scope Use

Not suitable for real-time ASR on low-power devices without optimization.
May not perform well on noisy environments or highly accented regional dialects outside the training data.

Bias, Risks, and Limitations

The model may struggle with low-resource dialects and uncommon speech patterns.
Biases may exist due to dataset imbalances in gender, age, or socio-economic backgrounds.
Ethical considerations should be taken when using the model for surveillance or sensitive applications.

How to Get Started with the Model

Use the following code snippet to load the model:

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch

processor = Wav2Vec2Processor.from_pretrained("your_model_id")
model = Wav2Vec2ForCTC.from_pretrained("your_model_id")

# Load and process audio file
audio_input = ...  # Provide audio tensor
inputs = processor(audio_input, return_tensors="pt", sampling_rate=16000)

# Perform ASR
with torch.no_grad():
    logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)

Training Details

Training Data

The model was trained on the Mozilla Common Voice 17.0, Common Voice 20.0 and OpenSLR dataset for Bangla.

Training Procedure

Preprocessing

Audio was resampled to 16kHz-8kHz-16kHz.
Transcripts were normalized to improve ASR performance.

Training Hyperparameters

Batch Size: 16
Learning Rate: 1e-5
Training Steps: 25000
Mixed Precision: FP16

Training Time and Compute

Hardware Used: RTX 4090
Training Time: 37 Hours
Dataset Size: 143k

Evaluation

Testing Data & Metrics

Metrics

WER: 11.26%
CER: 2.39

Factors

The model was evaluated on:

Standard Bangla speech
Various speaker demographics

Results

Performs well on clear, standard Bangla speech.
Struggles with strong regional accents and noisy environments.

Technical Specifications

Model Architecture

The model is based on facebook/w2v-bert-2.0, a hybrid Wav2Vec2-BERT model for ASR.

Citation

This model is based on the research presented in the following paper. If you use this model, please cite the original authors:

@misc{ridoy2025adaptabilityasrmodelslowresource,
      title={Adaptability of ASR Models on Low-Resource Language: A Comparative Study of Whisper and Wav2Vec-BERT on Bangla}, 
      author={Md Sazzadul Islam Ridoy and Sumi Akter and Md. Aminur Rahman},
      year={2025},
      eprint={2507.01931},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.01931}, 
}

Contact

For any issues or inquiries, please contact [email protected].

sazzadul
/

Shrutimala_Bangla_ASR