WhisperV3 Nepali v0.5
A Nepali automatic speech recognition (ASR) model fine‑tuned from Whisper Large V3 with LoRA. Trained on Nepali speech and transcriptions to improve accuracy on Nepali audio compared to the base model.
Model details
- Base model: Whisper Large V3 (loaded via Unsloth FastModel)
- Adapter method: LoRA on attention projections
- Target modules: q_proj, v_proj
- Rank (r): 64
- Alpha: 64
- Dropout: 0
- Gradient checkpointing: "unsloth"
- Task: Transcribe
- Language configuration: Nepali (generation_config.language set to <|ne|>; suppress_tokens cleared; no forced decoder ids)
- Precision: fp16 on GPUs without bf16; bf16 where supported
- Seed: 3407
This model was trained and saved as LoRA adapters, with optional merged 16‑bit/4‑bit export paths available via Unsloth utilities.
Intended uses and limitations
- Intended use: Transcribing Nepali speech (general domain, conversational and read speech).
- Out‑of‑scope: Non‑Nepali languages, heavy code‑switching, extreme noise, domain‑specific jargon not present in training data.
- Known limitations: Accuracy may degrade on noisy audio, long‑form audio without segmentation, or accents/styles unseen during training.
Training data
- Primary dataset: Common Voice 17.0 Nepali (language code "ne‑NP")
- Splits: train + validation used for training; test used for evaluation
- Audio: resampled to 16 kHz for Whisper
Data was prepared with a processing function that extracts Whisper input features from audio and tokenizes target transcripts, aligning “sentence” as the text field for Common Voice.
Training configuration
- Loader and framework: Hugging Face Datasets + Transformers with Unsloth acceleration
- Batching: per_device_train_batch_size = 2, gradient_accumulation_steps = 4
- Optimization: AdamW 8‑bit, learning_rate = 1e‑4, weight_decay = 0.01, cosine LR schedule
- Training length: num_train_epochs = 3 with max_steps = 200 for a quick run
- Evaluation: eval_strategy = "steps", eval_steps = 5, label_names = ["labels"]
- Logging: logging_steps = 1
- Other: remove_unused_columns = False (for PEFT forward signatures)
Training used a Google Colab T4 environment (around 14.7 GB GPU memory), with peak reserved memory during training around 6.2 GB in the referenced session.
How to use
Quick inference
from transformers import pipeline
import torch
asr = pipeline(
"automatic-speech-recognition",
model="chhatramani/WhisperV3_Nepali_v0.5", # replace with your model id if different
return_language=True,
torch_dtype=torch.float16,
)
result = asr("path/to/audio.wav") # 16 kHz mono recommended
print(result["text"])
Processor-level usage
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch, soundfile as sf
model_id = "chhatramani/WhisperV3_Nepali_v0.5"
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch.float16).eval().to("cuda")
processor = AutoProcessor.from_pretrained(model_id)
audio, sr = sf.read("path/to/audio.wav")
inputs = processor(audio, sampling_rate=sr, return_tensors="pt").to("cuda", torch.float16)
pred_ids = model.generate(**inputs)
text = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]
print(text)
Evaluation
Below is a minimal recipe to compute WER/CER on a Nepali test set (e.g., Common Voice 17.0 “test”). Adjust paths and batching for your setup.
from datasets import load_dataset, Audio
from transformers import pipeline
import evaluate
wer = evaluate.load("wer")
cer = evaluate.load("cer")
asr = pipeline(
"automatic-speech-recognition",
model="chhatramani/WhisperV3_Nepali_v0.5",
return_language=True
)
test = load_dataset("mozilla-foundation/common_voice_17_0", "ne-NP", split="test")
test = test.cast_column("audio", Audio(sampling_rate=16000))
refs, hyps = [], []
for ex in test:
ref = ex.get("sentence", "").strip()
if not ref:
continue
out = asr(ex["audio"]["array"])
hyp = out["text"].strip()
refs.append(ref)
hyps.append(hyp)
print("WER:", wer.compute(references=refs, predictions=hyps))
print("CER:", cer.compute(references=refs, predictions=hyps))
- Inference and eval pipeline patterns mirror the training notebook, including resampling to 16 kHz and mapping “sentence” as the text field.
If you have your own Nepali test set, ensure it’s sampled at 16 kHz and transcriptions are normalized consistently with training data.
Reproducibility
- Environment: Transformers + Datasets + Unsloth; GPU T4 session illustrated in the notebook
- Determinism: Seed fixed at 3407 for trainer and LoRA setup
- Saving: LoRA adapters saved via
save_pretrained
/push_to_hub
; optional merged exports to 16‑bit or 4‑bit are supported in Unsloth APIs
Acknowledgements
- Base model: Whisper Large V3
- Training utilities: Unsloth FastModel and PEFT LoRA support
- Dataset: mozilla-foundation/common_voice_17_0 (Nepali)
The included training notebook steps (installation, data prep, training loop, saving, and example inference) informed this model card’s details.
Model tree for chhatramani/WhisperV3_Nepali_v0.5
Base model
openai/whisper-large-v3