WhisperV3 Nepali v0.5

A Nepali automatic speech recognition (ASR) model fine‑tuned from Whisper Large V3 with LoRA. Trained on Nepali speech and transcriptions to improve accuracy on Nepali audio compared to the base model.

Model details

Base model: Whisper Large V3 (loaded via Unsloth FastModel)
Adapter method: LoRA on attention projections
- Target modules: q_proj, v_proj
- Rank (r): 64
- Alpha: 64
- Dropout: 0
- Gradient checkpointing: "unsloth"
Task: Transcribe
Language configuration: Nepali (generation_config.language set to <|ne|>; suppress_tokens cleared; no forced decoder ids)
Precision: fp16 on GPUs without bf16; bf16 where supported
Seed: 3407

This model was trained and saved as LoRA adapters, with optional merged 16‑bit/4‑bit export paths available via Unsloth utilities.

Intended uses and limitations

Intended use: Transcribing Nepali speech (general domain, conversational and read speech).
Out‑of‑scope: Non‑Nepali languages, heavy code‑switching, extreme noise, domain‑specific jargon not present in training data.
Known limitations: Accuracy may degrade on noisy audio, long‑form audio without segmentation, or accents/styles unseen during training.

Training data

Primary dataset: Common Voice 17.0 Nepali (language code "ne‑NP")
- Splits: train + validation used for training; test used for evaluation
- Audio: resampled to 16 kHz for Whisper

Data was prepared with a processing function that extracts Whisper input features from audio and tokenizes target transcripts, aligning “sentence” as the text field for Common Voice.

Training configuration

Loader and framework: Hugging Face Datasets + Transformers with Unsloth acceleration
Batching: per_device_train_batch_size = 2, gradient_accumulation_steps = 4
Optimization: AdamW 8‑bit, learning_rate = 1e‑4, weight_decay = 0.01, cosine LR schedule
Training length: num_train_epochs = 3 with max_steps = 200 for a quick run
Evaluation: eval_strategy = "steps", eval_steps = 5, label_names = ["labels"]
Logging: logging_steps = 1
Other: remove_unused_columns = False (for PEFT forward signatures)

Training used a Google Colab T4 environment (around 14.7 GB GPU memory), with peak reserved memory during training around 6.2 GB in the referenced session.

How to use

Quick inference

from transformers import pipeline
import torch

asr = pipeline(
    "automatic-speech-recognition",
    model="chhatramani/WhisperV3_Nepali_v0.5",   # replace with your model id if different
    return_language=True,
    torch_dtype=torch.float16,
)

result = asr("path/to/audio.wav")  # 16 kHz mono recommended
print(result["text"])

Processor-level usage

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch, soundfile as sf

model_id = "chhatramani/WhisperV3_Nepali_v0.5"
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch.float16).eval().to("cuda")
processor = AutoProcessor.from_pretrained(model_id)

audio, sr = sf.read("path/to/audio.wav")
inputs = processor(audio, sampling_rate=sr, return_tensors="pt").to("cuda", torch.float16)
pred_ids = model.generate(**inputs)
text = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]
print(text)

Evaluation

Below is a minimal recipe to compute WER/CER on a Nepali test set (e.g., Common Voice 17.0 “test”). Adjust paths and batching for your setup.

from datasets import load_dataset, Audio
from transformers import pipeline
import evaluate

wer = evaluate.load("wer")
cer = evaluate.load("cer")

asr = pipeline(
    "automatic-speech-recognition",
    model="chhatramani/WhisperV3_Nepali_v0.5",
    return_language=True
)

test = load_dataset("mozilla-foundation/common_voice_17_0", "ne-NP", split="test")
test = test.cast_column("audio", Audio(sampling_rate=16000))

refs, hyps = [], []
for ex in test:
    ref = ex.get("sentence", "").strip()
    if not ref:
        continue
    out = asr(ex["audio"]["array"])
    hyp = out["text"].strip()
    refs.append(ref)
    hyps.append(hyp)

print("WER:", wer.compute(references=refs, predictions=hyps))
print("CER:", cer.compute(references=refs, predictions=hyps))

Inference and eval pipeline patterns mirror the training notebook, including resampling to 16 kHz and mapping “sentence” as the text field.

If you have your own Nepali test set, ensure it’s sampled at 16 kHz and transcriptions are normalized consistently with training data.

Reproducibility

Environment: Transformers + Datasets + Unsloth; GPU T4 session illustrated in the notebook
Determinism: Seed fixed at 3407 for trainer and LoRA setup
Saving: LoRA adapters saved via save_pretrained / push_to_hub; optional merged exports to 16‑bit or 4‑bit are supported in Unsloth APIs

Acknowledgements

Base model: Whisper Large V3
Training utilities: Unsloth FastModel and PEFT LoRA support
Dataset: mozilla-foundation/common_voice_17_0 (Nepali)

The included training notebook steps (installation, data prep, training loop, saving, and example inference) informed this model card’s details.

chhatramani
/

WhisperV3_Nepali_v0.5