Model Card for Shrutimala Bangla ASR
Model Details
Model Description
This model is a fine-tuned version of facebook/w2v-bert-2.0
for automatic speech recognition (ASR) in Bangla. The model has been trained on a large Bangla dataset, primarily sourced from Mozilla Common Voice 17.0, Common Voice 20.0, OpenSLR and achieves a Word Error Rate (WER) of 11%.
- Developed by: Sazzadul Islam
- Model type: Wav2Vec-BERT-based Bangla ASR model
- Language(s): Bangla (bn)
- License: CC-BY-SA-4.0
- Fine-tuned from:
facebook/w2v-bert-2.0
Uses
Direct Use
This model can be used for automatic speech recognition (ASR) in Bangla and English, with applications in transcription, voice assistants, and accessibility tools.
Downstream Use
It can be further fine-tuned for domain-specific ASR tasks, including medical or legal transcription in Bangla.
Out-of-Scope Use
- Not suitable for real-time ASR on low-power devices without optimization.
- May not perform well on noisy environments or highly accented regional dialects outside the training data.
Bias, Risks, and Limitations
- The model may struggle with low-resource dialects and uncommon speech patterns.
- Biases may exist due to dataset imbalances in gender, age, or socio-economic backgrounds.
- Ethical considerations should be taken when using the model for surveillance or sensitive applications.
How to Get Started with the Model
Use the following code snippet to load the model:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch
processor = Wav2Vec2Processor.from_pretrained("your_model_id")
model = Wav2Vec2ForCTC.from_pretrained("your_model_id")
# Load and process audio file
audio_input = ... # Provide audio tensor
inputs = processor(audio_input, return_tensors="pt", sampling_rate=16000)
# Perform ASR
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)
Training Details
Training Data
The model was trained on the Mozilla Common Voice 17.0, Common Voice 20.0 and OpenSLR dataset for Bangla.
Training Procedure
Preprocessing
- Audio was resampled to 16kHz-8kHz-16kHz.
- Transcripts were normalized to improve ASR performance.
Training Hyperparameters
- Batch Size: 16
- Learning Rate: 1e-5
- Training Steps: 25000
- Mixed Precision: FP16
Training Time and Compute
- Hardware Used: RTX 4090
- Training Time: 37 Hours
- Dataset Size: 143k
Evaluation
Testing Data & Metrics
Metrics
- WER: 11.26%
- CER: 2.39
Factors
The model was evaluated on:
- Standard Bangla speech
- Various speaker demographics
Results
- Performs well on clear, standard Bangla speech.
- Struggles with strong regional accents and noisy environments.
Technical Specifications
Model Architecture
The model is based on facebook/w2v-bert-2.0
, a hybrid Wav2Vec2-BERT model for ASR.
Contact
For any issues or inquiries, please contact [email protected].
- Downloads last month
- 145
Model tree for sazzadul/Shrutimala_Bangla_ASR
Base model
facebook/w2v-bert-2.0