chhatramani
/

WhisperV3_Nepali_v0.5

+---
+license: apache-2.0
+tags:
+- unsloth
+datasets:
+- mozilla-foundation/common_voice_17_0
+base_model:
+- openai/whisper-large-v3
+pipeline_tag: automatic-speech-recognition
+---
+# WhisperV3 Nepali v0.5
+A Nepali automatic speech recognition (ASR) model fine‑tuned from Whisper Large V3 with LoRA. Trained on Nepali speech and transcriptions to improve accuracy on Nepali audio compared to the base model.
+---
+## Model details
+- **Base model:** Whisper Large V3 (loaded via Unsloth FastModel)
+- **Adapter method:** LoRA on attention projections
+  - **Target modules:** q_proj, v_proj
+  - **Rank (r):** 64
+  - **Alpha:** 64
+  - **Dropout:** 0
+  - **Gradient checkpointing:** "unsloth"
+- **Task:** Transcribe
+- **Language configuration:** Nepali (generation_config.language set to <|ne|>; suppress_tokens cleared; no forced decoder ids)
+- **Precision:** fp16 on GPUs without bf16; bf16 where supported
+- **Seed:** 3407
+> This model was trained and saved as LoRA adapters, with optional merged 16‑bit/4‑bit export paths available via Unsloth utilities.
+---
+## Intended uses and limitations
+- **Intended use:** Transcribing Nepali speech (general domain, conversational and read speech).
+- **Out‑of‑scope:** Non‑Nepali languages, heavy code‑switching, extreme noise, domain‑specific jargon not present in training data.
+- **Known limitations:** Accuracy may degrade on noisy audio, long‑form audio without segmentation, or accents/styles unseen during training.
+---
+## Training data
+- **Primary dataset:** Common Voice 17.0 Nepali (language code "ne‑NP")
+  - **Splits:** train + validation used for training; test used for evaluation
+  - **Audio:** resampled to 16 kHz for Whisper
+Data was prepared with a processing function that extracts Whisper input features from audio and tokenizes target transcripts, aligning “sentence” as the text field for Common Voice.
+---
+## Training configuration
+- **Loader and framework:** Hugging Face Datasets + Transformers with Unsloth acceleration
+- **Batching:** per_device_train_batch_size = 2, gradient_accumulation_steps = 4
+- **Optimization:** AdamW 8‑bit, learning_rate = 1e‑4, weight_decay = 0.01, cosine LR schedule
+- **Training length:** num_train_epochs = 3 with max_steps = 200 for a quick run
+- **Evaluation:** eval_strategy = "steps", eval_steps = 5, label_names = ["labels"]
+- **Logging:** logging_steps = 1
+- **Other:** remove_unused_columns = False (for PEFT forward signatures)
+Training used a Google Colab T4 environment (around 14.7 GB GPU memory), with peak reserved memory during training around 6.2 GB in the referenced session.
+---
+## How to use
+### Quick inference
+```python
+from transformers import pipeline
+import torch
+asr = pipeline(
+    "automatic-speech-recognition",
+    model="chhatramani/WhisperV3_Nepali_v0.5",   # replace with your model id if different
+    return_language=True,
+    torch_dtype=torch.float16,
+)
+result = asr("path/to/audio.wav")  # 16 kHz mono recommended
+print(result["text"])
+```
+### Processor-level usage
+```python
+from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
+import torch, soundfile as sf
+model_id = "chhatramani/WhisperV3_Nepali_v0.5"
+model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch.float16).eval().to("cuda")
+processor = AutoProcessor.from_pretrained(model_id)
+audio, sr = sf.read("path/to/audio.wav")
+inputs = processor(audio, sampling_rate=sr, return_tensors="pt").to("cuda", torch.float16)
+pred_ids = model.generate(**inputs)
+text = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]
+print(text)
+```
+### Evaluation
+Below is a minimal recipe to compute WER/CER on a Nepali test set (e.g., Common Voice 17.0 “test”). Adjust paths and batching for your setup.
+```python
+from datasets import load_dataset, Audio
+from transformers import pipeline
+import evaluate
+wer = evaluate.load("wer")
+cer = evaluate.load("cer")
+asr = pipeline(
+    "automatic-speech-recognition",
+    model="chhatramani/WhisperV3_Nepali_v0.5",
+    return_language=True
+)
+test = load_dataset("mozilla-foundation/common_voice_17_0", "ne-NP", split="test")
+test = test.cast_column("audio", Audio(sampling_rate=16000))
+refs, hyps = [], []
+for ex in test:
+    ref = ex.get("sentence", "").strip()
+    if not ref:
+        continue
+    out = asr(ex["audio"]["array"])
+    hyp = out["text"].strip()
+    refs.append(ref)
+    hyps.append(hyp)
+print("WER:", wer.compute(references=refs, predictions=hyps))
+print("CER:", cer.compute(references=refs, predictions=hyps))
+```
+* Inference and eval pipeline patterns mirror the training notebook, including resampling to 16 kHz and mapping “sentence” as the text field.
+> If you have your own Nepali test set, ensure it’s sampled at 16 kHz and transcriptions are normalized consistently with training data.
+## Reproducibility
+- **Environment:** Transformers + Datasets + Unsloth; GPU T4 session illustrated in the notebook
+- **Determinism:** Seed fixed at 3407 for trainer and LoRA setup
+- **Saving:** LoRA adapters saved via `save_pretrained` / `push_to_hub`; optional merged exports to 16‑bit or 4‑bit are supported in Unsloth APIs
+---
+## Acknowledgements
+- **Base model:** Whisper Large V3
+- **Training utilities:** Unsloth FastModel and PEFT LoRA support
+- **Dataset:** mozilla-foundation/common_voice_17_0 (Nepali)
+> The included training notebook steps (installation, data prep, training loop, saving, and example inference) informed this model card’s details.