junnei
/

Phi-4-multimodal-instruct-ko-speech

Automatic Speech Recognition

text-generation

Model card Files Files and versions Community

Phi-4-multimodal-instruct-ko-speech / README.md

junnei's picture

Update README.md

2e1b1c7 verified 3 months ago

|

3.46 kB

	---
	library_name: transformers
	datasets:
	- Bingsu/zeroth-korean
	- google/fleurs
	language:
	- ko
	metrics:
	- cer
	- wer
	- bleu
	base_model:
	- microsoft/Phi-4-multimodal-instruct
	model-index:
	- name: Phi-4-multimodal-instruct-ko-asr
	results:
	- task:
	type: automatic-speech-recognition
	dataset:
	type: Bingsu/zeroth_korean
	name: zeroth-korean-test
	metrics:
	- type: bleu
	name: zeroth-test-BLEU
	value: 94.837
	- type: cer
	name: zeroth-test-CER
	value: 1.316
	- type: wer
	name: zeroth-test-WER
	value: 2.951
	- task:
	type: automatic-speech-recognition
	dataset:
	type: google/flerus
	name: flerus-ko-test
	metrics:
	- type: bleu
	name: fleurs-test-BLEU
	value: 67.659
	- type: cer
	name: fleurs-test-CER
	value: 7.951
	- type: wer
	name: fleurs-test-WER
	value: 18.313
	pipeline_tag: automatic-speech-recognition
	---



	This model is fine-tuned from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on [Bingsu/zeroth-korean](https://huggingface.co/datasets/Bingsu/zeroth-korean), [google/flerus](https://huggingface.co/datasets/Bingsu/google/flerus) in 5 epochs.

	This model is trained 960 steps on datasets for Korean Audio Speech Recognition on H100.

	After that, we continue training with [CoVoST2 Dataset][Covost2] / [CoVoST2-Ko][Covost2-ko] for AST.

	[Covost2]: https://huggingface.co/datasets/junnei/covost2
	[Covost2-ko]: https://huggingface.co/datasets/junnei/covost2-ko
	[ASR]: https://huggingface.co/junnei/Phi-4-multimodal-instruct-ko-asr

	## Evaluation

	Evaluation was done on the following datasets:
	- ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
	- AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).

	Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).

	Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor) and [Phi-4-multimodal-finetune-ko-speech](https://huggingface.co/daekeun-ml/Phi-4-multimodal-finetune-ko-speech), ASR is significantly improved.

	\| Model \| zeroth-CER \| zeroth-WER \| fleurs-ko2en \| fleurs-ko2en-cot \| fleurs-en2ko \| fleurs-en2ko-cot \|
	\|------------------------------------------------\|-------------\|------------\|--------------\|------------------\|--------------\|------------------\|
	\| original \| 198.32 \| - \| 5.63 \| 2.42 \| 6.86 \| 4.17 \|
	\| daekeun-ml/Phi-4-multimodal-finetune-ko-speech \| 1.61 \| 3.54 \| 7.67 \| 8.38 \| 12.31 \| 9.69 \|
	\| seastar105/Phi-4-mm-inst-zeroth-kor \| 7.02 \| - \| 7.07 \| 9.19 \| 13.08 \| 9.35 \|
	\| [ASR finetune][ASR] \| 1.31 \| 2.95 \| 7.46 \| 6.24 \| 12.15 \| 8.91 \|
	\| + 1 epoch finetune with [Covost-Ko][Covost2-ko]\| 3.88 \| - \| 8.07 \| 10.09 \| 18.82 \| 15.41 \|
	\| AST finetuned model(this model) \| 1.77 \| 2.99 \| 8.01 \| 9.09 \| 17.09 \| 11.82 \|