Update README.md
Browse files
README.md
CHANGED
@@ -29,11 +29,6 @@ capabilities** through a Speech Adapter.
|
|
29 |
|
30 |
The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
|
31 |
|
32 |
-
~~The Gemma-3-MM family includes models of various sizes (1B, 4B, 12B, and 27B parameters),
|
33 |
-
with Gemma-3-4b-it-speech being a specific instance of this family.
|
34 |
-
These models maintain the original Gemma-3 capabilities while adding
|
35 |
-
multilingual speech recognition and translation abilities.~~
|
36 |
-
|
37 |
## Evaluation
|
38 |
|
39 |
Model evaluation metrics and results.
|
@@ -42,8 +37,10 @@ Here is [Script][Script] to evaluate model.
|
|
42 |
|
43 |
[Script]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/examples/evaluate_speech.py
|
44 |
[Covost2]: https://huggingface.co/datasets/junnei/covost2
|
|
|
45 |
[LibriSpeech]: https://huggingface.co/datasets/fixie-ai/librispeech_asr
|
46 |
[Fleurs]: https://huggingface.co/datasets/google/fleurs
|
|
|
47 |
|
48 |
[Link1]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_en_us_to_ko_kr.json
|
49 |
[Link2]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_Fleurs_en_us_to_ko_kr.json
|
@@ -51,6 +48,9 @@ Here is [Script][Script] to evaluate model.
|
|
51 |
[Link4]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_other_en_us_to_ko_kr.json
|
52 |
[Link5]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_en_us_to_ko_kr.json
|
53 |
[Link6]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_Fleurs_en_us_to_ko_kr.json
|
|
|
|
|
|
|
54 |
|
55 |
#### ASR
|
56 |
|
@@ -61,6 +61,15 @@ Here is [Script][Script] to evaluate model.
|
|
61 |
| [LibriSpeech-Clean][LibriSpeech] | ASR (English) | **94.28** | **0.98** | **2.91** | [Link][Link3] |
|
62 |
| [LibriSpeech-Other][LibriSpeech] | ASR (English) | **87.60** | **3.10** | **6.55** | [Link][Link4] |
|
63 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
64 |
|
65 |
#### AST
|
66 |
|
@@ -89,9 +98,18 @@ Inspiration: [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/phi-4-
|
|
89 |
|
90 |
- The model was trained by adding a **596B parameter Speech LoRA adapter** to the base Gemma-3-4b-it model.
|
91 |
|
92 |
-
- Due to limited computational resources, the model was **only trained for
|
|
|
|
|
|
|
|
|
|
|
|
|
93 |
|
94 |
-
-
|
|
|
|
|
|
|
95 |
|
96 |
## Limitations
|
97 |
|
@@ -127,7 +145,7 @@ from transformers import AutoProcessor, AutoModel
|
|
127 |
import torch
|
128 |
|
129 |
model_id = "junnei/gemma-3-4b-it-speech"
|
130 |
-
revision = "main" # or "
|
131 |
|
132 |
model = AutoModel.from_pretrained(
|
133 |
model_id, device_map="auto", revision = revision, trust_remote_code=True
|
|
|
29 |
|
30 |
The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
|
31 |
|
|
|
|
|
|
|
|
|
|
|
32 |
## Evaluation
|
33 |
|
34 |
Model evaluation metrics and results.
|
|
|
37 |
|
38 |
[Script]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/examples/evaluate_speech.py
|
39 |
[Covost2]: https://huggingface.co/datasets/junnei/covost2
|
40 |
+
[Covost2-ko]: https://huggingface.co/datasets/junnei/covost2
|
41 |
[LibriSpeech]: https://huggingface.co/datasets/fixie-ai/librispeech_asr
|
42 |
[Fleurs]: https://huggingface.co/datasets/google/fleurs
|
43 |
+
[Zeroth]: https://huggingface.co/datasets/Bingsu/zeroth-korean
|
44 |
|
45 |
[Link1]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_en_us_to_ko_kr.json
|
46 |
[Link2]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_Fleurs_en_us_to_ko_kr.json
|
|
|
48 |
[Link4]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_other_en_us_to_ko_kr.json
|
49 |
[Link5]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_en_us_to_ko_kr.json
|
50 |
[Link6]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_Fleurs_en_us_to_ko_kr.json
|
51 |
+
[Link7]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_Zeroth_ko_kr_to_en_us.json
|
52 |
+
[Link8]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_Fleurs_ko_kr_to_en_us.json
|
53 |
+
[Link9]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_CoVoST_ko_kr_to_en_us.json
|
54 |
|
55 |
#### ASR
|
56 |
|
|
|
61 |
| [LibriSpeech-Clean][LibriSpeech] | ASR (English) | **94.28** | **0.98** | **2.91** | [Link][Link3] |
|
62 |
| [LibriSpeech-Other][LibriSpeech] | ASR (English) | **87.60** | **3.10** | **6.55** | [Link][Link4] |
|
63 |
|
64 |
+
##### (Experimental) ASR in Korean Branch
|
65 |
+
|
66 |
+
Score is lower because Korean Normalizer is not applied
|
67 |
+
|
68 |
+
| Benchmark | Task | BLEU ↑ | CER ↓ | WER ↓ | Result |
|
69 |
+
| -------------------------------- |----------------|:-------------:|:------------:|:------------:|:-------------:|
|
70 |
+
| [Zeroth][Zeroth] | ASR (Korean) | **94.91** | **1.31** | **2.50** | [Link][Link7] |
|
71 |
+
| [Fleurs][Fleurs] | ASR (Korean) | **62.83** | **9.08** | **23.0** | [Link][Link8] |
|
72 |
+
| [Covost2][Covost2] | ASR (Korean) | **43.66** | **22.5** | **41.4** | [Link][Link9] |
|
73 |
|
74 |
#### AST
|
75 |
|
|
|
98 |
|
99 |
- The model was trained by adding a **596B parameter Speech LoRA adapter** to the base Gemma-3-4b-it model.
|
100 |
|
101 |
+
- Due to limited computational resources, the model was **only trained for limited datasets and epochs** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU.
|
102 |
+
|
103 |
+
- The training data was limited to **English and Korean languages** within **less than 30 seconds in duration.**
|
104 |
+
|
105 |
+
## Datasets
|
106 |
+
|
107 |
+
### ASR / AST
|
108 |
|
109 |
+
- [Covost2 Dataset][Covost2] / [No Download Version][Covost2-ko]
|
110 |
+
- [LibriSpeech][LibriSpeech]
|
111 |
+
- [Fleurs][Fleurs]
|
112 |
+
- [Zeroth][Zeroth]
|
113 |
|
114 |
## Limitations
|
115 |
|
|
|
145 |
import torch
|
146 |
|
147 |
model_id = "junnei/gemma-3-4b-it-speech"
|
148 |
+
revision = "main" # or "korean".
|
149 |
|
150 |
model = AutoModel.from_pretrained(
|
151 |
model_id, device_map="auto", revision = revision, trust_remote_code=True
|