File size: 9,031 Bytes
fbc6ee1
 
 
 
ede5c16
 
 
 
 
 
 
fbc6ee1
 
 
 
ede5c16
fbc6ee1
 
ede5c16
fbc6ee1
 
 
4237419
 
fbc6ee1
 
4237419
 
fbc6ee1
 
 
baeb95b
 
 
 
da4e0a4
 
493aed7
da4e0a4
baeb95b
1da5fe6
f7f22fe
 
1da5fe6
baeb95b
30528ba
 
 
 
 
 
1da5fe6
 
 
30528ba
c57bc69
baeb95b
30528ba
 
 
 
 
 
baeb95b
c57bc69
 
 
 
 
 
 
ec7ba45
1da5fe6
 
 
 
 
 
 
 
baeb95b
fbc6ee1
 
ab0b2ea
 
fbc6ee1
 
 
 
 
ab0b2ea
fbc6ee1
ab0b2ea
fbc6ee1
ab0b2ea
fbc6ee1
 
 
4237419
fbc6ee1
1da5fe6
 
 
 
 
 
 
fbc6ee1
1da5fe6
 
 
 
fbc6ee1
 
 
4237419
fbc6ee1
 
4237419
fbc6ee1
4237419
fbc6ee1
4237419
ab0b2ea
fbc6ee1
 
4237419
fbc6ee1
 
 
 
 
e61d2ab
fbc6ee1
 
e61d2ab
fbc6ee1
 
 
 
959264a
fbc6ee1
 
 
 
 
 
1da5fe6
fbc6ee1
 
6fbaccf
fbc6ee1
 
 
6fbaccf
fbc6ee1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ab0b2ea
959264a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d222e0e
 
 
 
 
 
 
 
 
 
 
 
 
fbc6ee1
 
 
 
 
 
ab0b2ea
fbc6ee1
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
---
license: gemma
library_name: transformers
base_model: google/gemma-3-4b-it
datasets:
- junnei/covost2
metrics:
- bleu
- cer
- wer
pipeline_tag: automatic-speech-recognition
---

# Gemma 3 MM model card


**Terms of Use**: [Terms][terms]

[terms]: https://ai.google.dev/gemma/terms

## Model Summary

**Gemma-3-MM** is a open multimodal instruction models that extend the 
capabilities of the original Gemma-3 models to **include speech processing.**

These models leverage the language and vision research used in the 
original Gemma-3 models and incorporate **additional speech processing 
capabilities** through a Speech Adapter. 

The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).

## Evaluation

Model evaluation metrics and results.

Here is [Script][Script] to evaluate model.

[Korean Branch]: https://huggingface.co/junnei/gemma-3-4b-it-speech/tree/korean
[Script]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/examples/evaluate_speech.py
[Covost2]: https://huggingface.co/datasets/junnei/covost2
[Covost2-ko]: https://huggingface.co/datasets/junnei/covost2
[LibriSpeech]: https://huggingface.co/datasets/fixie-ai/librispeech_asr
[Fleurs]: https://huggingface.co/datasets/google/fleurs
[Zeroth]: https://huggingface.co/datasets/Bingsu/zeroth-korean

[Link1]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_en_us_to_ko_kr.json
[Link2]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_Fleurs_en_us_to_ko_kr.json
[Link3]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_clean_en_us_to_ko_kr.json
[Link4]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_other_en_us_to_ko_kr.json
[Link5]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_en_us_to_ko_kr.json
[Link6]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_Fleurs_en_us_to_ko_kr.json
[Link7]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_Zeroth_ko_kr_to_en_us.json
[Link8]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_Fleurs_ko_kr_to_en_us.json
[Link9]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_CoVoST_ko_kr_to_en_us.json

### ASR

| Benchmark                        | Task           |     BLEU ↑    |     CER ↓    |     WER ↓    |     Result    |
| -------------------------------- |----------------|:-------------:|:------------:|:------------:|:-------------:|
| [Covost2][Covost2]               | ASR (English)  |   **86.09**   |   **4.12**   |   **7.83**   | [Link][Link1] |
| [Fleurs][Fleurs]                 | ASR (English)  |   **89.61**   |   **2.28**   |   **5.23**   | [Link][Link2] |
| [LibriSpeech-Clean][LibriSpeech] | ASR (English)  |   **94.28**   |   **0.98**   |   **2.91**   | [Link][Link3] |
| [LibriSpeech-Other][LibriSpeech] | ASR (English)  |   **87.60**   |   **3.10**   |   **6.55**   | [Link][Link4] |

### AST

| Benchmark                      | Task                          |     BLEU ↑    |     Result    |
| ------------------------------ |-------------------------------|:-------------:|:-------------:|
| [Covost2][Covost2]             | AST (0-shot, English-Korean)  |     31.55     | [Link][Link5] |
| [Fleurs][Fleurs]               | AST (0-shot, English-Korean)  |     11.05     | [Link][Link6] |

#### (Experimental) ASR : [Korean Branch][Korean Branch]

Score is lower because Korean Normalizer is not applied

| Benchmark                        | Task           |     BLEU ↑    |     CER ↓    |     WER ↓    |     Result    |
| -------------------------------- |----------------|:-------------:|:------------:|:------------:|:-------------:|
| [Zeroth][Zeroth]                 | ASR (Korean)   |   **94.91**   |   **1.31**   |   **2.50**   | [Link][Link7] |
| [Fleurs][Fleurs]                 | ASR (Korean)   |   **62.83**   |   **9.08**   |   **23.0**   | [Link][Link8] |
| [Covost2][Covost2]               | ASR (Korean)   |   **43.66**   |   **22.5**   |   **41.4**   | [Link][Link9] |

## Model Details

[junnei]: https://huggingface.co/junnei
Developed by: [junnei][junnei]

Model type: Multimodal (Text, Vision, Speech) Language Model

Language(s): Multilingual

License: [Gemma](https://ai.google.dev/gemma/terms)

Base model: [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)

Inspiration: [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/phi-4-multimodal-instruct)

## Training Details

- The model was trained by adding a **596B parameter Speech LoRA adapter** to the base Gemma-3-4b-it model.

- Due to limited computational resources, the model was **only trained for limited datasets and epochs** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU.

- The training data was limited to **English and Korean languages** within **less than 30 seconds in duration.**

## Datasets

### ASR / AST

- [Covost2 Dataset][Covost2] / [No Download Version][Covost2-ko]
- [LibriSpeech][LibriSpeech]
- [Fleurs][Fleurs]
- [Zeroth][Zeroth]

## Limitations

Note that this model is **just a Proof of Concept (PoC) for experimental purposes** and is not intended for production use. 
To improve the model's performance and reliability, the following areas need further development:

- More computational resources for extended training needed.

- For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).**

- Due to the lack of computing resources, 
this model **primarily recognizes audio files less than 30 seconds** in duration. 
As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.

- If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.

### Usage

Below, there are some code snippets on how to get quickly started with running the model.

First, upgrade your Transformers library. AudioInput for chat_template is supported now.

```sh
$ pip install -U transformers
```

Then, copy the snippet from the section that is relevant for your use case.

#### Running the model with chat_template

```python
from transformers import AutoProcessor, AutoModel
import torch

model_id = "junnei/gemma-3-4b-it-speech"
revision = "main" # or "korean".

model = AutoModel.from_pretrained(
    model_id, device_map="auto", revision = revision, trust_remote_code=True
).eval()

processor = AutoProcessor.from_pretrained(
    model_id, revision = revision, trust_remote_code=True
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
            {"type": "text", "text": "Transcribe this audio clip into text."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
)

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
print(response)

# What is shown in this image?
```


#### Running the model with raw data

```python
from io import BytesIO
from urllib.request import urlopen
import soundfile
from PIL import Image


# get Audio data from URL
url = "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"
audio, sr = soundfile.read(BytesIO(urlopen(url).read()))
audio_token = '<start_of_audio>'


messages = [
    {'role': 'user', 'content': audio_token + 'Translate this audio into Korean.'},
]

prompt = processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)


inputs = processor(text=prompt, audio=[audio], add_special_tokens=False, return_tensors="pt")

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
print(response)
```

### Finetune the model
[Finetune]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/examples/finetune_speech.py

Here is finetuning script : [Link][Finetune]

**You must change output_dir, upload_dir and fit your Datasets**

```bash
python finetune_speech.py

```



### Citation

```none
@article{gemma3mm_2025,
    title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
    author={Seongjun Jang},
    year={2025}
}

```