Phi-4-multimodal-finetune-ko-speech

This is a fine-tuned model for Korean speech-to-text translation, from microsoft/Phi-4-multimodal-instruct on the following datasets:

  • kresnik/zeroth_korean
  • mozilla-foundation/common_voice_17_0 (Used Korean speech only)
  • PolyAI/minds14 (Used Korean speech only)
  • Custom dataset on my own. The speech was a mix of fast and slow speech (Technical blog contents and presentations I have posted), with some modulation using audiomentations and this script

Total 35K samples. Each sample is a pair of Korean speech and its transcription. Dataset was sampled 16kHz.

The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the sample_finetune_speech.py script from microsoft/Phi-4-multimodal-instruct

The latest version of the model currently uploaded was fine-tuned by unfreezing the audio encoder, and the ASR performance was significantly improved compared to the baseline LoRA adapter-based fine-tuning. Comparing the full fine-tuning and LoRA fine-tuning, the CER on zeroth-test set is 1.61% and 2.72%, and the WER on zeroth-test set is 3.54% and 7.19%, respectively.

Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.

Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.

Evaluation

Evaluation was done on the following datasets:

  • ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) and WER (Word Error Rate) on zeroth-test set (457 samples).
  • AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation test set (270 samples).

Script is retrieved from here.

Compared to Phi-4-mm-inst-zeroth-kor, ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.

Model zeroth (CER) zeroth (WER) fleurs-ko2en fleurs-ko2en-cot fleurs-en2ko fleurs-en2ko-cot
original 99.16 99.63 5.63 2.42 6.86 4.17
Ours - speech full finetune (4 epochs) 1.61 3.54 7.67 8.38 12.31 9.69
LoRA finetune (4 epochs) 2.72 7.19 7.11 9.95 13.22 10.45
LoRA finetune (1 epoch) 3.80 11.52 7.03 7.04 12.50 9.54
Phi-4-mm-inst-zeroth-kor 7.02 17.31 7.07 9.19 13.08 9.35

Usage

Requirements

Works with the following packages. Please make sure to install them before using the model.

flash_attn==2.7.4.post1
torch==2.6.0
transformers==4.48.2
accelerate==1.4.0
soundfile==0.13.1
pillow==11.1.0
scipy==1.15.2
torchvision==0.21.0
backoff==2.2.1
peft==0.14.0
datasets==3.3.2
librosa==0.10.2.post1
pandas==2.2.3

Sample code

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig

max_new_tokens = 256
ft_model_path = "daekeun-ml/Phi-4-multimodal-finetune-ko-speech"
generation_config = GenerationConfig.from_pretrained(ft_model_path, 'generation_config.json')
processor = AutoProcessor.from_pretrained(ft_model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    ft_model_path,
    trust_remote_code=True,
    torch_dtype='auto',
    _attn_implementation='flash_attention_2',
).cuda()

user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# task prompt is from technical report
asr_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio clip into text.{prompt_suffix}{assistant_prompt}'
ast_ko_prompt = f'{user_prompt}<|audio_1|>Translate the audio to Korean.{prompt_suffix}{assistant_prompt}'
ast_cot_ko_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to Korean. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'
ast_en_prompt = f'{user_prompt}<|audio_1|>Translate the audio to English.{prompt_suffix}{assistant_prompt}'
ast_cot_en_prompt = f'{user_prompt}<|audio_1|>Transcribe the audio to text, and then translate the audio to English. Use <sep> as a separator between the original transcript and the translation.{prompt_suffix}{assistant_prompt}'

asr_ds = load_dataset("kresnik/zeroth_korean", split="test")

# ASR
item = asr_ds[0]
audio = (item["audio"]["array"], item["audio"]["sampling_rate"])
inputs = processor(text=asr_prompt, audios=[audio], return_tensors='pt').to(model.device)
generate_ids = model.generate(
    **inputs,
    max_new_tokens=max_new_tokens,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0] 
print(response) # "몬터규는 자녀들이 사랑을 제대로 못 받고 크면 매우 심각한 결과가 초래된다는 결론을 내렸습니다"

Demos

Please refer to the Jupyter notebook and video clips in the demo folder. They are not production-quality as they were simply fine-tuned for PoC purposes, but you can see that they transcribe and translate with high accuracy even when a native speaker speaks quite quickly.

References

Downloads last month
266
Safetensors
Model size
5.57B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for daekeun-ml/Phi-4-multimodal-finetune-ko-speech

Finetuned
(7)
this model

Datasets used to train daekeun-ml/Phi-4-multimodal-finetune-ko-speech

Evaluation results