File size: 9,031 Bytes
fbc6ee1 ede5c16 fbc6ee1 ede5c16 fbc6ee1 ede5c16 fbc6ee1 4237419 fbc6ee1 4237419 fbc6ee1 baeb95b da4e0a4 493aed7 da4e0a4 baeb95b 1da5fe6 f7f22fe 1da5fe6 baeb95b 30528ba 1da5fe6 30528ba c57bc69 baeb95b 30528ba baeb95b c57bc69 ec7ba45 1da5fe6 baeb95b fbc6ee1 ab0b2ea fbc6ee1 ab0b2ea fbc6ee1 ab0b2ea fbc6ee1 ab0b2ea fbc6ee1 4237419 fbc6ee1 1da5fe6 fbc6ee1 1da5fe6 fbc6ee1 4237419 fbc6ee1 4237419 fbc6ee1 4237419 fbc6ee1 4237419 ab0b2ea fbc6ee1 4237419 fbc6ee1 e61d2ab fbc6ee1 e61d2ab fbc6ee1 959264a fbc6ee1 1da5fe6 fbc6ee1 6fbaccf fbc6ee1 6fbaccf fbc6ee1 ab0b2ea 959264a d222e0e fbc6ee1 ab0b2ea fbc6ee1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 |
---
license: gemma
library_name: transformers
base_model: google/gemma-3-4b-it
datasets:
- junnei/covost2
metrics:
- bleu
- cer
- wer
pipeline_tag: automatic-speech-recognition
---
# Gemma 3 MM model card
**Terms of Use**: [Terms][terms]
[terms]: https://ai.google.dev/gemma/terms
## Model Summary
**Gemma-3-MM** is a open multimodal instruction models that extend the
capabilities of the original Gemma-3 models to **include speech processing.**
These models leverage the language and vision research used in the
original Gemma-3 models and incorporate **additional speech processing
capabilities** through a Speech Adapter.
The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
## Evaluation
Model evaluation metrics and results.
Here is [Script][Script] to evaluate model.
[Korean Branch]: https://huggingface.co/junnei/gemma-3-4b-it-speech/tree/korean
[Script]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/examples/evaluate_speech.py
[Covost2]: https://huggingface.co/datasets/junnei/covost2
[Covost2-ko]: https://huggingface.co/datasets/junnei/covost2
[LibriSpeech]: https://huggingface.co/datasets/fixie-ai/librispeech_asr
[Fleurs]: https://huggingface.co/datasets/google/fleurs
[Zeroth]: https://huggingface.co/datasets/Bingsu/zeroth-korean
[Link1]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_en_us_to_ko_kr.json
[Link2]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_Fleurs_en_us_to_ko_kr.json
[Link3]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_clean_en_us_to_ko_kr.json
[Link4]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_other_en_us_to_ko_kr.json
[Link5]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_en_us_to_ko_kr.json
[Link6]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_Fleurs_en_us_to_ko_kr.json
[Link7]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_Zeroth_ko_kr_to_en_us.json
[Link8]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_Fleurs_ko_kr_to_en_us.json
[Link9]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/korean/asr_CoVoST_ko_kr_to_en_us.json
### ASR
| Benchmark | Task | BLEU β | CER β | WER β | Result |
| -------------------------------- |----------------|:-------------:|:------------:|:------------:|:-------------:|
| [Covost2][Covost2] | ASR (English) | **86.09** | **4.12** | **7.83** | [Link][Link1] |
| [Fleurs][Fleurs] | ASR (English) | **89.61** | **2.28** | **5.23** | [Link][Link2] |
| [LibriSpeech-Clean][LibriSpeech] | ASR (English) | **94.28** | **0.98** | **2.91** | [Link][Link3] |
| [LibriSpeech-Other][LibriSpeech] | ASR (English) | **87.60** | **3.10** | **6.55** | [Link][Link4] |
### AST
| Benchmark | Task | BLEU β | Result |
| ------------------------------ |-------------------------------|:-------------:|:-------------:|
| [Covost2][Covost2] | AST (0-shot, English-Korean) | 31.55 | [Link][Link5] |
| [Fleurs][Fleurs] | AST (0-shot, English-Korean) | 11.05 | [Link][Link6] |
#### (Experimental) ASR : [Korean Branch][Korean Branch]
Score is lower because Korean Normalizer is not applied
| Benchmark | Task | BLEU β | CER β | WER β | Result |
| -------------------------------- |----------------|:-------------:|:------------:|:------------:|:-------------:|
| [Zeroth][Zeroth] | ASR (Korean) | **94.91** | **1.31** | **2.50** | [Link][Link7] |
| [Fleurs][Fleurs] | ASR (Korean) | **62.83** | **9.08** | **23.0** | [Link][Link8] |
| [Covost2][Covost2] | ASR (Korean) | **43.66** | **22.5** | **41.4** | [Link][Link9] |
## Model Details
[junnei]: https://huggingface.co/junnei
Developed by: [junnei][junnei]
Model type: Multimodal (Text, Vision, Speech) Language Model
Language(s): Multilingual
License: [Gemma](https://ai.google.dev/gemma/terms)
Base model: [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)
Inspiration: [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/phi-4-multimodal-instruct)
## Training Details
- The model was trained by adding a **596B parameter Speech LoRA adapter** to the base Gemma-3-4b-it model.
- Due to limited computational resources, the model was **only trained for limited datasets and epochs** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU.
- The training data was limited to **English and Korean languages** within **less than 30 seconds in duration.**
## Datasets
### ASR / AST
- [Covost2 Dataset][Covost2] / [No Download Version][Covost2-ko]
- [LibriSpeech][LibriSpeech]
- [Fleurs][Fleurs]
- [Zeroth][Zeroth]
## Limitations
Note that this model is **just a Proof of Concept (PoC) for experimental purposes** and is not intended for production use.
To improve the model's performance and reliability, the following areas need further development:
- More computational resources for extended training needed.
- For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).**
- Due to the lack of computing resources,
this model **primarily recognizes audio files less than 30 seconds** in duration.
As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
- If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
### Usage
Below, there are some code snippets on how to get quickly started with running the model.
First, upgrade your Transformers library. AudioInput for chat_template is supported now.
```sh
$ pip install -U transformers
```
Then, copy the snippet from the section that is relevant for your use case.
#### Running the model with chat_template
```python
from transformers import AutoProcessor, AutoModel
import torch
model_id = "junnei/gemma-3-4b-it-speech"
revision = "main" # or "korean".
model = AutoModel.from_pretrained(
model_id, device_map="auto", revision = revision, trust_remote_code=True
).eval()
processor = AutoProcessor.from_pretrained(
model_id, revision = revision, trust_remote_code=True
)
messages = [
{
"role": "user",
"content": [
{"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
{"type": "text", "text": "Transcribe this audio clip into text."}
]
}
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
)
with torch.inference_mode():
generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)
# What is shown in this image?
```
#### Running the model with raw data
```python
from io import BytesIO
from urllib.request import urlopen
import soundfile
from PIL import Image
# get Audio data from URL
url = "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"
audio, sr = soundfile.read(BytesIO(urlopen(url).read()))
audio_token = '<start_of_audio>'
messages = [
{'role': 'user', 'content': audio_token + 'Translate this audio into Korean.'},
]
prompt = processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=prompt, audio=[audio], add_special_tokens=False, return_tensors="pt")
with torch.inference_mode():
generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)
```
### Finetune the model
[Finetune]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/examples/finetune_speech.py
Here is finetuning script : [Link][Finetune]
**You must change output_dir, upload_dir and fit your Datasets**
```bash
python finetune_speech.py
```
### Citation
```none
@article{gemma3mm_2025,
title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
author={Seongjun Jang},
year={2025}
}
``` |