junnei commited on
Commit
1da5fe6
·
verified ·
1 Parent(s): da4e0a4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -8
README.md CHANGED
@@ -29,11 +29,6 @@ capabilities** through a Speech Adapter.
29
 
30
  The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
31
 
32
- ~~The Gemma-3-MM family includes models of various sizes (1B, 4B, 12B, and 27B parameters),
33
- with Gemma-3-4b-it-speech being a specific instance of this family.
34
- These models maintain the original Gemma-3 capabilities while adding
35
- multilingual speech recognition and translation abilities.~~
36
-
37
  ## Evaluation
38
 
39
  Model evaluation metrics and results.
@@ -42,8 +37,10 @@ Here is [Script][Script] to evaluate model.
42
 
43
  [Script]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/examples/evaluate_speech.py
44
  [Covost2]: https://huggingface.co/datasets/junnei/covost2
 
45
  [LibriSpeech]: https://huggingface.co/datasets/fixie-ai/librispeech_asr
46
  [Fleurs]: https://huggingface.co/datasets/google/fleurs
 
47
 
48
  [Link1]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_en_us_to_ko_kr.json
49
  [Link2]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_Fleurs_en_us_to_ko_kr.json
@@ -51,6 +48,9 @@ Here is [Script][Script] to evaluate model.
51
  [Link4]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_other_en_us_to_ko_kr.json
52
  [Link5]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_en_us_to_ko_kr.json
53
  [Link6]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_Fleurs_en_us_to_ko_kr.json
 
 
 
54
 
55
  #### ASR
56
 
@@ -61,6 +61,15 @@ Here is [Script][Script] to evaluate model.
61
  | [LibriSpeech-Clean][LibriSpeech] | ASR (English) | **94.28** | **0.98** | **2.91** | [Link][Link3] |
62
  | [LibriSpeech-Other][LibriSpeech] | ASR (English) | **87.60** | **3.10** | **6.55** | [Link][Link4] |
63
 
 
 
 
 
 
 
 
 
 
64
 
65
  #### AST
66
 
@@ -89,9 +98,18 @@ Inspiration: [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/phi-4-
89
 
90
  - The model was trained by adding a **596B parameter Speech LoRA adapter** to the base Gemma-3-4b-it model.
91
 
92
- - Due to limited computational resources, the model was **only trained for 1 epoch** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.
 
 
 
 
 
 
93
 
94
- - The training data was limited to **English and Korean languages** from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within **less than 30 seconds in duration.**
 
 
 
95
 
96
  ## Limitations
97
 
@@ -127,7 +145,7 @@ from transformers import AutoProcessor, AutoModel
127
  import torch
128
 
129
  model_id = "junnei/gemma-3-4b-it-speech"
130
- revision = "main" # or "v1.0". cause model is being updated, main branch could be crashed.
131
 
132
  model = AutoModel.from_pretrained(
133
  model_id, device_map="auto", revision = revision, trust_remote_code=True
 
29
 
30
  The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
31
 
 
 
 
 
 
32
  ## Evaluation
33
 
34
  Model evaluation metrics and results.
 
37
 
38
  [Script]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/examples/evaluate_speech.py
39
  [Covost2]: https://huggingface.co/datasets/junnei/covost2
40
+ [Covost2-ko]: https://huggingface.co/datasets/junnei/covost2
41
  [LibriSpeech]: https://huggingface.co/datasets/fixie-ai/librispeech_asr
42
  [Fleurs]: https://huggingface.co/datasets/google/fleurs
43
+ [Zeroth]: https://huggingface.co/datasets/Bingsu/zeroth-korean
44
 
45
  [Link1]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_en_us_to_ko_kr.json
46
  [Link2]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_Fleurs_en_us_to_ko_kr.json
 
48
  [Link4]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_other_en_us_to_ko_kr.json
49
  [Link5]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_en_us_to_ko_kr.json
50
  [Link6]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_Fleurs_en_us_to_ko_kr.json
51
+ [Link7]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_Zeroth_ko_kr_to_en_us.json
52
+ [Link8]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_Fleurs_ko_kr_to_en_us.json
53
+ [Link9]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_CoVoST_ko_kr_to_en_us.json
54
 
55
  #### ASR
56
 
 
61
  | [LibriSpeech-Clean][LibriSpeech] | ASR (English) | **94.28** | **0.98** | **2.91** | [Link][Link3] |
62
  | [LibriSpeech-Other][LibriSpeech] | ASR (English) | **87.60** | **3.10** | **6.55** | [Link][Link4] |
63
 
64
+ ##### (Experimental) ASR in Korean Branch
65
+
66
+ Score is lower because Korean Normalizer is not applied
67
+
68
+ | Benchmark | Task | BLEU ↑ | CER ↓ | WER ↓ | Result |
69
+ | -------------------------------- |----------------|:-------------:|:------------:|:------------:|:-------------:|
70
+ | [Zeroth][Zeroth] | ASR (Korean) | **94.91** | **1.31** | **2.50** | [Link][Link7] |
71
+ | [Fleurs][Fleurs] | ASR (Korean) | **62.83** | **9.08** | **23.0** | [Link][Link8] |
72
+ | [Covost2][Covost2] | ASR (Korean) | **43.66** | **22.5** | **41.4** | [Link][Link9] |
73
 
74
  #### AST
75
 
 
98
 
99
  - The model was trained by adding a **596B parameter Speech LoRA adapter** to the base Gemma-3-4b-it model.
100
 
101
+ - Due to limited computational resources, the model was **only trained for limited datasets and epochs** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU.
102
+
103
+ - The training data was limited to **English and Korean languages** within **less than 30 seconds in duration.**
104
+
105
+ ## Datasets
106
+
107
+ ### ASR / AST
108
 
109
+ - [Covost2 Dataset][Covost2] / [No Download Version][Covost2-ko]
110
+ - [LibriSpeech][LibriSpeech]
111
+ - [Fleurs][Fleurs]
112
+ - [Zeroth][Zeroth]
113
 
114
  ## Limitations
115
 
 
145
  import torch
146
 
147
  model_id = "junnei/gemma-3-4b-it-speech"
148
+ revision = "main" # or "korean".
149
 
150
  model = AutoModel.from_pretrained(
151
  model_id, device_map="auto", revision = revision, trust_remote_code=True