chhatramani commited on
Commit
d9ea573
·
verified ·
1 Parent(s): 33cce97

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +155 -5
README.md CHANGED
@@ -1,5 +1,155 @@
1
- ---
2
- license: apache-2.0
3
- tags:
4
- - unsloth
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - unsloth
5
+ datasets:
6
+ - mozilla-foundation/common_voice_17_0
7
+ base_model:
8
+ - openai/whisper-large-v3
9
+ pipeline_tag: automatic-speech-recognition
10
+ ---
11
+
12
+ # WhisperV3 Nepali v0.5
13
+
14
+ A Nepali automatic speech recognition (ASR) model fine‑tuned from Whisper Large V3 with LoRA. Trained on Nepali speech and transcriptions to improve accuracy on Nepali audio compared to the base model.
15
+
16
+ ---
17
+
18
+ ## Model details
19
+
20
+ - **Base model:** Whisper Large V3 (loaded via Unsloth FastModel)
21
+ - **Adapter method:** LoRA on attention projections
22
+ - **Target modules:** q_proj, v_proj
23
+ - **Rank (r):** 64
24
+ - **Alpha:** 64
25
+ - **Dropout:** 0
26
+ - **Gradient checkpointing:** "unsloth"
27
+ - **Task:** Transcribe
28
+ - **Language configuration:** Nepali (generation_config.language set to <|ne|>; suppress_tokens cleared; no forced decoder ids)
29
+ - **Precision:** fp16 on GPUs without bf16; bf16 where supported
30
+ - **Seed:** 3407
31
+
32
+ > This model was trained and saved as LoRA adapters, with optional merged 16‑bit/4‑bit export paths available via Unsloth utilities.
33
+
34
+ ---
35
+
36
+ ## Intended uses and limitations
37
+
38
+ - **Intended use:** Transcribing Nepali speech (general domain, conversational and read speech).
39
+ - **Out‑of‑scope:** Non‑Nepali languages, heavy code‑switching, extreme noise, domain‑specific jargon not present in training data.
40
+ - **Known limitations:** Accuracy may degrade on noisy audio, long‑form audio without segmentation, or accents/styles unseen during training.
41
+
42
+ ---
43
+
44
+ ## Training data
45
+
46
+ - **Primary dataset:** Common Voice 17.0 Nepali (language code "ne‑NP")
47
+ - **Splits:** train + validation used for training; test used for evaluation
48
+ - **Audio:** resampled to 16 kHz for Whisper
49
+
50
+ Data was prepared with a processing function that extracts Whisper input features from audio and tokenizes target transcripts, aligning “sentence” as the text field for Common Voice.
51
+
52
+ ---
53
+
54
+ ## Training configuration
55
+
56
+ - **Loader and framework:** Hugging Face Datasets + Transformers with Unsloth acceleration
57
+ - **Batching:** per_device_train_batch_size = 2, gradient_accumulation_steps = 4
58
+ - **Optimization:** AdamW 8‑bit, learning_rate = 1e‑4, weight_decay = 0.01, cosine LR schedule
59
+ - **Training length:** num_train_epochs = 3 with max_steps = 200 for a quick run
60
+ - **Evaluation:** eval_strategy = "steps", eval_steps = 5, label_names = ["labels"]
61
+ - **Logging:** logging_steps = 1
62
+ - **Other:** remove_unused_columns = False (for PEFT forward signatures)
63
+
64
+ Training used a Google Colab T4 environment (around 14.7 GB GPU memory), with peak reserved memory during training around 6.2 GB in the referenced session.
65
+
66
+ ---
67
+
68
+ ## How to use
69
+
70
+ ### Quick inference
71
+
72
+ ```python
73
+ from transformers import pipeline
74
+ import torch
75
+
76
+ asr = pipeline(
77
+ "automatic-speech-recognition",
78
+ model="chhatramani/WhisperV3_Nepali_v0.5", # replace with your model id if different
79
+ return_language=True,
80
+ torch_dtype=torch.float16,
81
+ )
82
+
83
+ result = asr("path/to/audio.wav") # 16 kHz mono recommended
84
+ print(result["text"])
85
+ ```
86
+
87
+ ### Processor-level usage
88
+ ```python
89
+ from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
90
+ import torch, soundfile as sf
91
+
92
+ model_id = "chhatramani/WhisperV3_Nepali_v0.5"
93
+ model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch.float16).eval().to("cuda")
94
+ processor = AutoProcessor.from_pretrained(model_id)
95
+
96
+ audio, sr = sf.read("path/to/audio.wav")
97
+ inputs = processor(audio, sampling_rate=sr, return_tensors="pt").to("cuda", torch.float16)
98
+ pred_ids = model.generate(**inputs)
99
+ text = processor.batch_decode(pred_ids, skip_special_tokens=True)[0]
100
+ print(text)
101
+
102
+ ```
103
+
104
+ ### Evaluation
105
+ Below is a minimal recipe to compute WER/CER on a Nepali test set (e.g., Common Voice 17.0 “test”). Adjust paths and batching for your setup.
106
+
107
+ ```python
108
+ from datasets import load_dataset, Audio
109
+ from transformers import pipeline
110
+ import evaluate
111
+
112
+ wer = evaluate.load("wer")
113
+ cer = evaluate.load("cer")
114
+
115
+ asr = pipeline(
116
+ "automatic-speech-recognition",
117
+ model="chhatramani/WhisperV3_Nepali_v0.5",
118
+ return_language=True
119
+ )
120
+
121
+ test = load_dataset("mozilla-foundation/common_voice_17_0", "ne-NP", split="test")
122
+ test = test.cast_column("audio", Audio(sampling_rate=16000))
123
+
124
+ refs, hyps = [], []
125
+ for ex in test:
126
+ ref = ex.get("sentence", "").strip()
127
+ if not ref:
128
+ continue
129
+ out = asr(ex["audio"]["array"])
130
+ hyp = out["text"].strip()
131
+ refs.append(ref)
132
+ hyps.append(hyp)
133
+
134
+ print("WER:", wer.compute(references=refs, predictions=hyps))
135
+ print("CER:", cer.compute(references=refs, predictions=hyps))
136
+
137
+ ```
138
+ * Inference and eval pipeline patterns mirror the training notebook, including resampling to 16 kHz and mapping “sentence” as the text field.
139
+ > If you have your own Nepali test set, ensure it’s sampled at 16 kHz and transcriptions are normalized consistently with training data.
140
+
141
+ ## Reproducibility
142
+
143
+ - **Environment:** Transformers + Datasets + Unsloth; GPU T4 session illustrated in the notebook
144
+ - **Determinism:** Seed fixed at 3407 for trainer and LoRA setup
145
+ - **Saving:** LoRA adapters saved via `save_pretrained` / `push_to_hub`; optional merged exports to 16‑bit or 4‑bit are supported in Unsloth APIs
146
+
147
+ ---
148
+
149
+ ## Acknowledgements
150
+
151
+ - **Base model:** Whisper Large V3
152
+ - **Training utilities:** Unsloth FastModel and PEFT LoRA support
153
+ - **Dataset:** mozilla-foundation/common_voice_17_0 (Nepali)
154
+
155
+ > The included training notebook steps (installation, data prep, training loop, saving, and example inference) informed this model card’s details.