---
library_name: transformers
datasets:
- Bingsu/zeroth-korean
- google/fleurs
language:
- ko
metrics:
- cer
- wer
- bleu
base_model:
- microsoft/Phi-4-multimodal-instruct
model-index:
- name: Phi-4-multimodal-instruct-ko-asr
  results:
  - task:
      type: automatic-speech-recognition
    dataset:
      type: Bingsu/zeroth_korean
      name: zeroth-korean-test
    metrics:
    - type: bleu
      name: zeroth-test-BLEU
      value: 94.837
    - type: cer
      name: zeroth-test-CER
      value: 1.316
    - type: wer
      name: zeroth-test-WER
      value: 2.951
  - task:
      type: automatic-speech-recognition
    dataset:
      type: google/flerus
      name: flerus-ko-test
    metrics:
    - type: bleu
      name: fleurs-test-BLEU
      value: 67.659
    - type: cer
      name: fleurs-test-CER
      value: 7.951
    - type: wer
      name: fleurs-test-WER
      value: 18.313
pipeline_tag: automatic-speech-recognition
---


This model is fine-tuned from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on [Bingsu/zeroth-korean](https://huggingface.co/datasets/Bingsu/zeroth-korean), [google/flerus](https://huggingface.co/datasets/Bingsu/google/flerus) in 5 epochs.

This model is trained 960 steps on datasets for Korean Audio Speech Recognition on H100.

After that, we continue training with [CoVoST2 Dataset][Covost2] / [CoVoST2-Ko][Covost2-ko] for AST.

AST Finetuned model is Here : [Phi-4-multimodal-instruct-ko-speech][Speech]

[Covost2]: https://huggingface.co/datasets/junnei/covost2
[Covost2-ko]: https://huggingface.co/datasets/junnei/covost2-ko
[Speech]: https://huggingface.co/junnei/Phi-4-multimodal-instruct-ko-speech

## Evaluation

Evaluation was done on the following datasets:
- ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
- AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).

Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).

Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor) and  [Phi-4-multimodal-finetune-ko-speech](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech), ASR is significantly improved.

| Model                                          | zeroth-CER  | zeroth-WER | fleurs-ko_en-BLEU | fleurs-ko_en-cot-BLEU | fleurs-en_ko-BLEU | fleurs-en_ko-cot-BLEU |
|------------------------------------------------|-------------|------------|-------------------|-----------------------|-------------------|-----------------------|
| original                                       |  198.32     |     -      |       5.63        |         2.42          |       6.86        |         4.17          |
| daekeun-ml/Phi-4-multimodal-finetune-ko-speech |  1.61       |    3.54    |       7.67        |         8.38          |       12.31       |         9.69          |
| seastar105/Phi-4-mm-inst-zeroth-kor            |  7.02       |     -      |       7.07        |         9.19          |       13.08       |         9.35          |
| **ASR finetune(this model)**                   |  **1.31**   |    2.95    |       7.46        |         6.24          |       12.15       |         8.91          |
| + 1 epoch finetune with [Covost-Ko][Covost2-ko]|  3.88       |     -      |     **8.07**      |       **10.09**       |     **18.82**     |      **15.41**        |
| [**AST finetuned model**][Speech]              |  **1.77**   |  **2.99**  |     **8.01**      |       **9.09**        |     **17.09**     |      **11.82**        |