mjwong commited on
Commit
564c8e2
·
verified ·
1 Parent(s): c5059c4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +245 -3
README.md CHANGED
@@ -1,3 +1,245 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ base_model:
6
+ - microsoft/Phi-4-multimodal-instruct
7
+ pipeline_tag: automatic-speech-recognition
8
+ library_name: transformers
9
+ model-index:
10
+ - name: Phi-4-mm-inst-asr-singlish
11
+ results:
12
+ - task:
13
+ type: automatic-speech-recognition
14
+ dataset:
15
+ name: SASRBench-v1
16
+ type: mjwong/SASRBench-v1
17
+ split: test
18
+ metrics:
19
+ - name: WER
20
+ type: WER
21
+ value: 13.16
22
+ - name: Phi-4-mm-inst-asr-singlish
23
+ results:
24
+ - task:
25
+ type: automatic-speech-recognition
26
+ dataset:
27
+ name: AMI
28
+ type: edinburghcstr/ami
29
+ subset: ihm
30
+ split: test
31
+ metrics:
32
+ - name: WER
33
+ type: WER
34
+ value: 20.23
35
+ - name: Phi-4-mm-inst-asr-singlish
36
+ results:
37
+ - task:
38
+ type: automatic-speech-recognition
39
+ dataset:
40
+ name: GigaSpeech
41
+ type: speechcolab/gigaspeech
42
+ subset: test
43
+ split: test
44
+ metrics:
45
+ - name: WER
46
+ type: WER
47
+ value: 10.34
48
+ tags:
49
+ - nlp
50
+ - code
51
+ - audio
52
+ - automatic-speech-recognition
53
+ - speech-summarization
54
+ - speech-translation
55
+ - visual-question-answering
56
+ - phi-4-multimodal
57
+ - phi
58
+ - phi-4-mini
59
+ ---
60
+
61
+ # Phi-4-mm-inst-asr-singlish
62
+
63
+ **Phi-4-multimodal-instruct-asr-singlish** (Phi-4-mm-inst-asr-singlish) represents a targeted effort to address a key limitation that broad LMMs such as Microsoft’s Phi-4 still have: under-representation of regional dialects. Singlish’s code-switching and distinctive prosody frequently confounds generic models.
64
+
65
+ However, Phi-4 has undergone vast pre-training that already captures complex linguistic structures, promising better generalisation than smaller ASR systems like Whisper. This targeted adaptation of [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) (Phi-4-mm-inst) marks progress toward the broader vision of a unified model that can listen, comprehend, and respond naturally—laying the groundwork for voice-first agents that reason, translate, and generate code seamlessly within a single contextual framework.
66
+
67
+ ## Model Details
68
+
69
+ - **Developed by:** Ming Jie Wong
70
+ - **Base Model:** [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
71
+ - **Model Type:** Decoder-only Transformer with vision / speech adapters
72
+ - **Metrics:** Word Error Rate (WER)
73
+ - **Languages Supported:** English (with a focus on Singlish)
74
+ - **License:** MIT
75
+
76
+ ### Description
77
+
78
+ This work employs supervised fine-tuning (SFT) of Phi-4-mm-inst for Singlish ASR by leveraging 66.9k paired audio–transcript examples. The dataset is derived exclusively from the Part 3 Same Room Environment Close-talk Mic recordings of [IMDA's NSC Corpus](https://www.imda.gov.sg/how-we-can-help/national-speech-corpus).
79
+
80
+ Rather than retraining all model parameters, we selectively unfreeze only the `audio_embed` module—specifically its encoder and audio projection layers—while keeping the remaining weights fixed. During training, each audio clip is paired with its ground-truth transcript, to which we append a dedicated end-of-transcription marker (`<|end|><|endoftext|>`). We then optimize a standard cross-entropy loss over the token sequences, teaching the model both to transcribe audio features into text and to generate the marker at transcription end. This surgical, data-driven approach focuses computational resources on adapting the model’s audio processing to Singlish’s unique phonetic, prosodic, and code-switching characteristics, without altering its core language understanding.
81
+
82
+ The original Part 3 of the National Speech Corpus comprises approximately 1,000 hours of conversational speech from around 1,000 local English speakers, recorded in pairs. These conversations cover everyday topics and include interactive game-based dialogues. Recordings were conducted in two environments:
83
+ - Same Room, where speakers shared a room and were recorded using a close-talk mic and a boundary mic.
84
+ - Separate Room, where each speaker was recorded individually using a standing mic and a telephone (IVR).
85
+
86
+ Audio segments for the internal dataset were extracted using these criteria:
87
+ - **Minimum Word Count:** 10 words
88
+
89
+ _This threshold was chosen to ensure that each audio segment contains sufficient linguistic context for the model to better understand instructions in Singlish. Shorter segments may bias the model towards specific utterances or phrases, limiting its overall comprehension._
90
+ - **Maximum Duration:** 20 seconds
91
+
92
+ _This threshold was chosen to provide enough context for accurate transcription while minimizing noise and computational complexity for longer audio segments._
93
+ - **Sampling Rate**: All audio segments are down-sampled to 16kHz.
94
+
95
+ Full experiments details will be added soon.
96
+
97
+ ### Fine-Tuning Details
98
+
99
+ We applied fine-tuning on a single A100-80GB GPU.
100
+
101
+ #### Training Hyperparameters
102
+ The following hyperparameters are used:
103
+ - **learning_rate**: 0.0001
104
+ - **train_batch_size**: 8
105
+ - **eval_batch_size**: 8
106
+ - **seed**: 42
107
+ - **Optimizer:**
108
+ - **Name:** ADAMW_TORCH
109
+ - **Betas:** (0.9, 0.99)
110
+ - **Epsilon:** 1e-07
111
+ - **Optimizer Arguments:** No additional optimizer arguments
112
+ - **lr_scheduler_type**: cosine
113
+ - **lr_scheduler_warmup_ratio**: 0.1
114
+ - **num_epochs**: 1
115
+
116
+ ## Benchmark Performance
117
+
118
+ We evaluated Phi-4-mm-inst-asr-singlish on the following datasets:
119
+
120
+ - [SASRBench-v1](https://huggingface.co/datasets/mjwong/SASRBench-v1): A benchmark dataset for evaluating ASR performance on Singlish.
121
+
122
+ - [AMI](https://huggingface.co/datasets/edinburghcstr/ami): A widely used dataset for meeting transcription and diarization tasks. This work specifically uses the IHM (Individual Headset Microphone) recordings.
123
+
124
+ - [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech): A large-scale open-source dataset with diverse English audio, covering read, conversational, and spontaneous speech.
125
+
126
+ ### Model Performance
127
+
128
+ | **Dataset** | **Model** | **Rel. RTFx** | **WER** |
129
+ |-----------------|-----------------------------------------------------------------------------------------------------------|---------------|------------|
130
+ | SASRBench-v1 | [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | 1.00 | 33.00% |
131
+ | SASRBench-v1 | [mjwong/Phi-4-mm-inst-asr-singlish](https://huggingface.co/mjwong/Phi-4-mm-inst-asr-singlish) | 1.03 | **13.16%** |
132
+ | SASRBench-v1 | [mjwong/whisper-large-v3-singlish](https://huggingface.co/mjwong/whisper-large-v3-singlish) | 2.60 | 16.41% |
133
+ | SASRBench-v1 | [mjwong/whisper-large-v3-turbo-singlish](https://huggingface.co/mjwong/whisper-large-v3-turbo-singlish) | **6.13** | 13.35% |
134
+ | SASRBench-v1 | mjwong/whisper-large-v3-singlish + [DRAFT](https://huggingface.co/mjwong/whisper-large-v3-singlish-DRAFT) | 5.72 | 14.84% |
135
+ ||||||
136
+ | AMI | [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | 1.00 | **14.74%** |
137
+ | AMI | [mjwong/Phi-4-mm-inst-asr-singlish](https://huggingface.co/mjwong/Phi-4-mm-inst-asr-singlish) | 1.11 | 20.23% |
138
+ | AMI | [mjwong/whisper-large-v3-singlish](https://huggingface.co/mjwong/whisper-large-v3-singlish) | 1.14 | 23.72% |
139
+ | AMI | [mjwong/whisper-large-v3-turbo-singlish](https://huggingface.co/mjwong/whisper-large-v3-turbo-singlish) | 1.75 | 16.99% |
140
+ | AMI | mjwong/whisper-large-v3-singlish + [DRAFT](https://huggingface.co/mjwong/whisper-large-v3-singlish-DRAFT) | **2.59** | 22.06% |
141
+ ||||||
142
+ | GigaSpeech | [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) | 1.00 | 24.65% |
143
+ | GigaSpeech | [mjwong/Phi-4-mm-inst-asr-singlish](https://huggingface.co/mjwong/Phi-4-mm-inst-asr-singlish) | 1.20 | **10.34%** |
144
+ | GigaSpeech | [mjwong/whisper-large-v3-singlish](https://huggingface.co/mjwong/whisper-large-v3-singlish) | 2.03 | 13.15% |
145
+ | GigaSpeech | [mjwong/whisper-large-v3-turbo-singlish](https://huggingface.co/mjwong/whisper-large-v3-turbo-singlish) | 3.97 | 11.54% |
146
+ | GigaSpeech | mjwong/whisper-large-v3-singlish + [DRAFT](https://huggingface.co/mjwong/whisper-large-v3-singlish-DRAFT) | **4.81** | 12.81% |
147
+
148
+ ### Experimental Observations
149
+
150
+ #### Base vs. Fine‑Tuned Behavior
151
+
152
+ Base Model: Phi‑4’s generalist design allowed instruction‑based transcription but lacked a robust stopping criterion. When prompted to generate a fixed number of tokens, it often continued past the audio’s end, repeating or fabricating tokens until the `max_new_tokens` limit or an implicit end‑of‑sequence signal was reached.
153
+
154
+ Fine‑Tuned Model: By associating the end‑of‑transcription markers during training, the model learned task‑specific stopping. Even with a high `max_new_tokens` setting, it reliably generated `<|end|><|endoftext|>` immediately after completing the actual transcription, avoiding extraneous output.
155
+
156
+ #### Behavior on Long Audio Clips
157
+
158
+ The output length remains bounded by `max_new_tokens`, irrespective of input duration. For clips requiring fewer tokens than the limit, the fine‑tuned model cleanly stops at the marker. For longer clips, it produces a truncated but well‑formed transcription up to the token limit, without failing or crashing.
159
+
160
+ ### Conclusion
161
+
162
+ Fine-tuning Phi-4-mm-inst slashes its Singlish WER from 33% to 13.16%, closing—and slightly beating—the gap to our best-performing fine-tuned [Whisper-large-v3-turbo-singlish](https://huggingface.co/mjwong/whisper-large-v3-turbo-singlish). While the absolute edge over Whisper is small, Phi-4’s real value is that it combines near–state-of-the-art ASR with a full generative LLM in one package. For Singlish speakers this means a single model that hears, understands, and responds natively, paving the way for voice-first agents that can reason, translate, or generate code without ever leaving the same context.
163
+
164
+ ## Disclaimer
165
+
166
+ While this model has been fine-tuned to better recognize Singlish, users may experience inaccuracies, biases, or unexpected outputs, particularly in challenging audio conditions or with speakers using non-standard variations. Use of this model is at your own risk; the developers and distributors are not liable for any consequences arising from its use. Please validate results before deploying in any sensitive or production environment.
167
+
168
+ ## How to use the model
169
+
170
+ For first-time use: You might need to install the additional libraries below:
171
+
172
+ ```python
173
+ !pip install backoff
174
+
175
+ !sudo apt-get install -y cmake ninja-build
176
+ !pip install wheel
177
+
178
+ from pkg_resources import get_distribution, DistributionNotFound
179
+
180
+ package_name = 'flash_attn'
181
+
182
+ try:
183
+ dist = get_distribution(package_name)
184
+ print(f"'{package_name}' version {dist.version} is already installed.")
185
+ except DistributionNotFound:
186
+ !MAX_JOBS=8 pip install flash-attn --no-build-isolation
187
+ ```
188
+
189
+ The model can be loaded like so:
190
+
191
+ ```python
192
+ import torch
193
+ import soundfile
194
+ from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
195
+
196
+ model_path = "mjwong/Phi-4-mm-inst-asr-singlish"
197
+
198
+ kwargs = {}
199
+ kwargs['torch_dtype'] = torch.bfloat16
200
+
201
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
202
+
203
+ model = AutoModelForCausalLM.from_pretrained(
204
+ model_path,
205
+ trust_remote_code=True,
206
+ torch_dtype='auto',
207
+ _attn_implementation='flash_attention_2',
208
+ ).cuda()
209
+
210
+ generation_config = GenerationConfig.from_pretrained(model_path, 'generation_config.json')
211
+
212
+ user_prompt = '<|user|>'
213
+ assistant_prompt = '<|assistant|>'
214
+ prompt_suffix = '<|end|>'
215
+
216
+ speech_prompt = "Based on the attached audio, generate a comprehensive text transcription of the spoken content."
217
+ prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
218
+ ```
219
+
220
+ You can then transcribe audios of arbitrary length. As an illustration, the audio file `ignite.wav` can be downloaded from [this link](https://github.com/microsoft/PhiCookBook/blob/main/md/02.Application/05.Audio/Phi4/Transciption/ignite.wav).
221
+
222
+ ```python
223
+ audio = soundfile.read('./ignite.wav')
224
+
225
+ inputs = processor(text=prompt, audios=[audio], return_tensors='pt').to('cuda:0')
226
+
227
+ generate_ids = model.generate(
228
+ **inputs,
229
+ max_new_tokens=1200,
230
+ generation_config=generation_config,
231
+ num_logits_to_keep=1,
232
+ )
233
+
234
+ generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
235
+
236
+ response = processor.batch_decode(
237
+ generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
238
+ )[0]
239
+
240
+ print(response)
241
+ ```
242
+
243
+ ## Contact
244
+
245
+ For more information, please reach out to [email protected].