Phi-4-multimodal-finetune-ko-speech

This is a fine-tuned model for Korean speech-to-text translation, from microsoft/Phi-4-multimodal-instruct on the following datasets:

kresnik/zeroth_korean
mozilla-foundation/common_voice_17_0 (Used Korean speech only)
PolyAI/minds14 (Used Korean speech only)
Custom dataset on my own. The speech was a mix of fast and slow speech (Technical blog contents and presentations I have posted), with some modulation using audiomentations and this script

Total 35K samples. Each sample is a pair of Korean speech and its transcription. Dataset was sampled 16kHz.

The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the sample_finetune_speech.py script from microsoft/Phi-4-multimodal-instruct

The latest version of the model currently uploaded was fine-tuned by unfreezing the audio encoder, and the ASR performance was significantly improved compared to the baseline LoRA adapter-based fine-tuning. Comparing the full fine-tuning and LoRA fine-tuning, the CER on zeroth-test set is 1.61% and 2.72%, and the WER on zeroth-test set is 3.54% and 7.19%, respectively.

Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.

Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.

Evaluation

Evaluation was done on the following datasets:

ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) and WER (Word Error Rate) on zeroth-test set (457 samples).
AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation test set (270 samples).

Script is retrieved from here.

Compared to Phi-4-mm-inst-zeroth-kor, ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.

Model	zeroth (CER)	zeroth (WER)	fleurs-ko2en	fleurs-ko2en-cot	fleurs-en2ko	fleurs-en2ko-cot
original	99.16	99.63	5.63	2.42	6.86	4.17
Ours - speech full finetune (4 epochs)	1.61	3.54	7.67	8.38	12.31	9.69
LoRA finetune (4 epochs)	2.72	7.19	7.11	9.95	13.22	10.45
LoRA finetune (1 epoch)	3.80	11.52	7.03	7.04	12.50	9.54
Phi-4-mm-inst-zeroth-kor	7.02	17.31	7.07	9.19	13.08	9.35

Usage

Requirements

Works with the following packages. Please make sure to install them before using the model.

flash_attn==2.7.4.post1
torch==2.6.0
transformers==4.48.2
accelerate==1.4.0
soundfile==0.13.1
pillow==11.1.0
scipy==1.15.2
torchvision==0.21.0
backoff==2.2.1
peft==0.14.0
datasets==3.3.2
librosa==0.10.2.post1
pandas==2.2.3

Sample code

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

max_new_tokens = 256
ft_model_path = "daekeun-ml/Phi-4-multimodal-finetune-ko-speech"
generation_config = GenerationConfig.from_pretrained(ft_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(ft_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    ft_model_path,
    trust_remote_code=True,
    torch_dtype='auto',
    _attn_implementation='flash_attention_2',
).cuda()

user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'

asr_ds = load_dataset("kresnik/zeroth_korean", split="test")

# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
    **inputs,
    max_new_tokens=max_new_tokens,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] 
print(response) # "몬터규는 자녀들이 사랑을 제대로 못 받고 크면 매우 심각한 결과가 초래된다는 결론을 내렸습니다"

Demos

Please refer to the Jupyter notebook and video clips in the demo folder. They are not production-quality as they were simply fine-tuned for PoC purposes, but you can see that they transcribe and translate with high accuracy even when a native speaker speaks quite quickly.

References

Downloads last month: 1,002

Safetensors

Model size

5.57B params

Tensor type

F32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for daekeun-ml/Phi-4-multimodal-finetune-ko-speech

Base model

microsoft/Phi-4-multimodal-instruct

Finetuned

(43)

this model

Datasets used to train daekeun-ml/Phi-4-multimodal-finetune-ko-speech

Evaluation results

ko2en on fleurs (ko-en test intersection)
self-reported

7.670
ko2en-cot on fleurs (ko-en test intersection)
self-reported

8.380
en2ko (ko-mecab) on fleurs (ko-en test intersection)
self-reported

12.310
en2ko-cot (ko-mecab) on fleurs (ko-en test intersection)
self-reported

9.690
test CER on zeroth_korean test
self-reported

1.610
test WER on zeroth_korean test
self-reported

3.540

View on Papers With Code