junnei
/

gemma-3-4b-it-speech

@@ -29,11 +29,6 @@ capabilities** through a Speech Adapter.
 The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
-~~The Gemma-3-MM family includes models of various sizes (1B, 4B, 12B, and 27B parameters),
-with Gemma-3-4b-it-speech being a specific instance of this family.
-These models maintain the original Gemma-3 capabilities while adding
-multilingual speech recognition and translation abilities.~~
 ## Evaluation
 Model evaluation metrics and results.
@@ -42,8 +37,10 @@ Here is [Script][Script] to evaluate model.
 [Script]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/examples/evaluate_speech.py
 [Covost2]: https://huggingface.co/datasets/junnei/covost2
 [LibriSpeech]: https://huggingface.co/datasets/fixie-ai/librispeech_asr
 [Fleurs]: https://huggingface.co/datasets/google/fleurs
 [Link1]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_en_us_to_ko_kr.json
 [Link2]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_Fleurs_en_us_to_ko_kr.json
@@ -51,6 +48,9 @@ Here is [Script][Script] to evaluate model.
 [Link4]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_other_en_us_to_ko_kr.json
 [Link5]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_en_us_to_ko_kr.json
 [Link6]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_Fleurs_en_us_to_ko_kr.json
 #### ASR
@@ -61,6 +61,15 @@ Here is [Script][Script] to evaluate model.
 | [LibriSpeech-Clean][LibriSpeech] | ASR (English)  |   **94.28**   |   **0.98**   |   **2.91**   | [Link][Link3] |
 | [LibriSpeech-Other][LibriSpeech] | ASR (English)  |   **87.60**   |   **3.10**   |   **6.55**   | [Link][Link4] |
 #### AST
@@ -89,9 +98,18 @@ Inspiration: [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/phi-4-
 - The model was trained by adding a **596B parameter Speech LoRA adapter** to the base Gemma-3-4b-it model.
-- Due to limited computational resources, the model was **only trained for 1 epoch** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.
-- The training data was limited to **English and Korean languages** from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within **less than 30 seconds in duration.**
 ## Limitations
@@ -127,7 +145,7 @@ from transformers import AutoProcessor, AutoModel
 import torch
 model_id = "junnei/gemma-3-4b-it-speech"
-revision = "main" # or "v1.0". cause model is being updated, main branch could be crashed.
 model = AutoModel.from_pretrained(
     model_id, device_map="auto", revision = revision, trust_remote_code=True

 The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
 ## Evaluation
 Model evaluation metrics and results.
 [Script]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/examples/evaluate_speech.py
 [Covost2]: https://huggingface.co/datasets/junnei/covost2
+[Covost2-ko]: https://huggingface.co/datasets/junnei/covost2
 [LibriSpeech]: https://huggingface.co/datasets/fixie-ai/librispeech_asr
 [Fleurs]: https://huggingface.co/datasets/google/fleurs
+[Zeroth]: https://huggingface.co/datasets/Bingsu/zeroth-korean
 [Link1]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_en_us_to_ko_kr.json
 [Link2]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_Fleurs_en_us_to_ko_kr.json
 [Link4]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_other_en_us_to_ko_kr.json
 [Link5]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_en_us_to_ko_kr.json
 [Link6]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_Fleurs_en_us_to_ko_kr.json
+[Link7]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_Zeroth_ko_kr_to_en_us.json
+[Link8]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_Fleurs_ko_kr_to_en_us.json
+[Link9]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_CoVoST_ko_kr_to_en_us.json
 #### ASR
 | [LibriSpeech-Clean][LibriSpeech] | ASR (English)  |   **94.28**   |   **0.98**   |   **2.91**   | [Link][Link3] |
 | [LibriSpeech-Other][LibriSpeech] | ASR (English)  |   **87.60**   |   **3.10**   |   **6.55**   | [Link][Link4] |
+##### (Experimental) ASR in Korean Branch
+Score is lower because Korean Normalizer is not applied
+| Benchmark                        | Task           |     BLEU ↑    |     CER ↓    |     WER ↓    |     Result    |
+| -------------------------------- |----------------|:-------------:|:------------:|:------------:|:-------------:|
+| [Zeroth][Zeroth]                 | ASR (Korean)   |   **94.91**   |   **1.31**   |   **2.50**   | [Link][Link7] |
+| [Fleurs][Fleurs]                 | ASR (Korean)   |   **62.83**   |   **9.08**   |   **23.0**   | [Link][Link8] |
+| [Covost2][Covost2]               | ASR (Korean)   |   **43.66**   |   **22.5**   |   **41.4**   | [Link][Link9] |
 #### AST
 - The model was trained by adding a **596B parameter Speech LoRA adapter** to the base Gemma-3-4b-it model.
+- Due to limited computational resources, the model was **only trained for limited datasets and epochs** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU.
+- The training data was limited to **English and Korean languages** within **less than 30 seconds in duration.**
+## Datasets
+### ASR / AST
+- [Covost2 Dataset][Covost2] / [No Download Version][Covost2-ko]
+- [LibriSpeech][LibriSpeech]
+- [Fleurs][Fleurs]
+- [Zeroth][Zeroth]
 ## Limitations
 import torch
 model_id = "junnei/gemma-3-4b-it-speech"
+revision = "main" # or "korean".
 model = AutoModel.from_pretrained(
     model_id, device_map="auto", revision = revision, trust_remote_code=True