Model Card for Shrutimala Bangla ASR

Model Details

Model Description

This model is a fine-tuned version of facebook/w2v-bert-2.0 for automatic speech recognition (ASR) in Bangla. The model has been trained on a large Bangla dataset, primarily sourced from Mozilla Common Voice 17.0, Common Voice 20.0, OpenSLR and achieves a Word Error Rate (WER) of 11%.

  • Developed by: Sazzadul Islam
  • Model type: Wav2Vec-BERT-based Bangla ASR model
  • Language(s): Bangla (bn)
  • License: CC-BY-SA-4.0
  • Fine-tuned from: facebook/w2v-bert-2.0

Uses

Direct Use

This model can be used for automatic speech recognition (ASR) in Bangla and English, with applications in transcription, voice assistants, and accessibility tools.

Downstream Use

It can be further fine-tuned for domain-specific ASR tasks, including medical or legal transcription in Bangla.

Out-of-Scope Use

  • Not suitable for real-time ASR on low-power devices without optimization.
  • May not perform well on noisy environments or highly accented regional dialects outside the training data.

Bias, Risks, and Limitations

  • The model may struggle with low-resource dialects and uncommon speech patterns.
  • Biases may exist due to dataset imbalances in gender, age, or socio-economic backgrounds.
  • Ethical considerations should be taken when using the model for surveillance or sensitive applications.

How to Get Started with the Model

Use the following code snippet to load the model:

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import torch

processor = Wav2Vec2Processor.from_pretrained("your_model_id")
model = Wav2Vec2ForCTC.from_pretrained("your_model_id")

# Load and process audio file
audio_input = ...  # Provide audio tensor
inputs = processor(audio_input, return_tensors="pt", sampling_rate=16000)

# Perform ASR
with torch.no_grad():
    logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription)

Training Details

Training Data

The model was trained on the Mozilla Common Voice 17.0, Common Voice 20.0 and OpenSLR dataset for Bangla.

Training Procedure

Preprocessing

  • Audio was resampled to 16kHz-8kHz-16kHz.
  • Transcripts were normalized to improve ASR performance.

Training Hyperparameters

  • Batch Size: 16
  • Learning Rate: 1e-5
  • Training Steps: 25000
  • Mixed Precision: FP16

Training Time and Compute

  • Hardware Used: RTX 4090
  • Training Time: 37 Hours
  • Dataset Size: 143k

Evaluation

Testing Data & Metrics

Metrics

  • WER: 11.26%
  • CER: 2.39

Factors

The model was evaluated on:

  • Standard Bangla speech
  • Various speaker demographics

Results

  • Performs well on clear, standard Bangla speech.
  • Struggles with strong regional accents and noisy environments.

Technical Specifications

Model Architecture

The model is based on facebook/w2v-bert-2.0, a hybrid Wav2Vec2-BERT model for ASR.

Contact

For any issues or inquiries, please contact [email protected].

Downloads last month
145
Safetensors
Model size
606M params
Tensor type
F32
·
Inference Providers NEW
or

Model tree for sazzadul/Shrutimala_Bangla_ASR

Finetuned
(252)
this model

Datasets used to train sazzadul/Shrutimala_Bangla_ASR

Space using sazzadul/Shrutimala_Bangla_ASR 1