WhisperV3 Nepali v0.5

A Nepali automatic speech recognition (ASR) model fine‑tuned from Whisper Large V3 with LoRA. Trained on Nepali speech and transcriptions to improve accuracy on Nepali audio compared to the base model.


Model details

  • Base model: Whisper Large V3 (loaded via Unsloth FastModel)
  • Adapter method: LoRA on attention projections
    • Target modules: q_proj, v_proj
    • Rank (r): 64
    • Alpha: 64
    • Dropout: 0
    • Gradient checkpointing: "unsloth"
  • Task: Transcribe
  • Language configuration: Nepali (generation_config.language set to <|ne|>; suppress_tokens cleared; no forced decoder ids)
  • Precision: fp16 on GPUs without bf16; bf16 where supported
  • Seed: 3407

This model was trained and saved as LoRA adapters, with optional merged 16‑bit/4‑bit export paths available via Unsloth utilities.


Intended uses and limitations

  • Intended use: Transcribing Nepali speech (general domain, conversational and read speech).
  • Out‑of‑scope: Non‑Nepali languages, heavy code‑switching, extreme noise, domain‑specific jargon not present in training data.
  • Known limitations: Accuracy may degrade on noisy audio, long‑form audio without segmentation, or accents/styles unseen during training.

Training data

  • Primary dataset: Common Voice 17.0 Nepali (language code "ne‑NP")
    • Splits: train + validation used for training; test used for evaluation
    • Audio: resampled to 16 kHz for Whisper

Data was prepared with a processing function that extracts Whisper input features from audio and tokenizes target transcripts, aligning “sentence” as the text field for Common Voice.


Training configuration

  • Loader and framework: Hugging Face Datasets + Transformers with Unsloth acceleration
  • Batching: per_device_train_batch_size = 2, gradient_accumulation_steps = 4
  • Optimization: AdamW 8‑bit, learning_rate = 1e‑4, weight_decay = 0.01, cosine LR schedule
  • Training length: num_train_epochs = 3 with max_steps = 200 for a quick run
  • Evaluation: eval_strategy = "steps", eval_steps = 5, label_names = ["labels"]
  • Logging: logging_steps = 1
  • Other: remove_unused_columns = False (for PEFT forward signatures)

Training used a Google Colab T4 environment (around 14.7 GB GPU memory), with peak reserved memory during training around 6.2 GB in the referenced session.


How to use

Quick inference

from transformers import pipeline
import torch

asr = pipeline(
    "automatic-speech-recognition",
    model="chhatramani/WhisperV3_Nepali_v0.5",   # replace with your model id if different
    return_language=True,
    torch_dtype=torch.float16,
)

result = asr("path/to/audio.wav")  # 16 kHz mono recommended
print(result["text"])

Processor-level usage

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
import torch, soundfile as sf

model_id = "chhatramani/WhisperV3_Nepali_v0.5"
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch.float16).eval().to("cuda")
processor = AutoProcessor.from_pretrained(model_id)

audio, sr = sf.read("path/to/audio.wav")
inputs = processor(audio, sampling_rate=sr, return_tensors="pt").to("cuda", torch.float16)
pred_ids = model.generate(**inputs)
text = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]
print(text)

Evaluation

Below is a minimal recipe to compute WER/CER on a Nepali test set (e.g., Common Voice 17.0 “test”). Adjust paths and batching for your setup.

from datasets import load_dataset, Audio
from transformers import pipeline
import evaluate

wer = evaluate.load("wer")
cer = evaluate.load("cer")

asr = pipeline(
    "automatic-speech-recognition",
    model="chhatramani/WhisperV3_Nepali_v0.5",
    return_language=True
)

test = load_dataset("mozilla-foundation/common_voice_17_0", "ne-NP", split="test")
test = test.cast_column("audio", Audio(sampling_rate=16000))

refs, hyps = [], []
for ex in test:
    ref = ex.get("sentence", "").strip()
    if not ref:
        continue
    out = asr(ex["audio"]["array"])
    hyp = out["text"].strip()
    refs.append(ref)
    hyps.append(hyp)

print("WER:", wer.compute(references=refs, predictions=hyps))
print("CER:", cer.compute(references=refs, predictions=hyps))
  • Inference and eval pipeline patterns mirror the training notebook, including resampling to 16 kHz and mapping “sentence” as the text field.

    If you have your own Nepali test set, ensure it’s sampled at 16 kHz and transcriptions are normalized consistently with training data.

Reproducibility

  • Environment: Transformers + Datasets + Unsloth; GPU T4 session illustrated in the notebook
  • Determinism: Seed fixed at 3407 for trainer and LoRA setup
  • Saving: LoRA adapters saved via save_pretrained / push_to_hub; optional merged exports to 16‑bit or 4‑bit are supported in Unsloth APIs

Acknowledgements

  • Base model: Whisper Large V3
  • Training utilities: Unsloth FastModel and PEFT LoRA support
  • Dataset: mozilla-foundation/common_voice_17_0 (Nepali)

The included training notebook steps (installation, data prep, training loop, saving, and example inference) informed this model card’s details.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for chhatramani/WhisperV3_Nepali_v0.5

Finetuned
(630)
this model

Dataset used to train chhatramani/WhisperV3_Nepali_v0.5