|
--- |
|
license: gemma |
|
library_name: transformers |
|
base_model: google/gemma-3-4b-it |
|
datasets: |
|
- junnei/covost2 |
|
metrics: |
|
- bleu |
|
- cer |
|
- wer |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
|
|
# Gemma 3 MM model card |
|
|
|
|
|
**Terms of Use**: [Terms][terms] |
|
|
|
[terms]: https://ai.google.dev/gemma/terms |
|
|
|
## Model Summary |
|
|
|
**Gemma-3-MM** is a open multimodal instruction models that extend the |
|
capabilities of the original Gemma-3 models to **include speech processing.** |
|
|
|
These models leverage the language and vision research used in the |
|
original Gemma-3 models and incorporate **additional speech processing |
|
capabilities** through a Speech Adapter. |
|
|
|
The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model). |
|
|
|
~~The Gemma-3-MM family includes models of various sizes (1B, 4B, 12B, and 27B parameters), |
|
with Gemma-3-4b-it-speech being a specific instance of this family. |
|
These models maintain the original Gemma-3 capabilities while adding |
|
multilingual speech recognition and translation abilities.~~ |
|
|
|
## Evaluation |
|
|
|
Model evaluation metrics and results. |
|
|
|
Here is [Script][Script] to evaluate model. |
|
|
|
[Script]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/examples/evaluate_speech.py |
|
[Covost2]: https://huggingface.co/datasets/junnei/covost2 |
|
[LibriSpeech]: https://huggingface.co/datasets/fixie-ai/librispeech_asr |
|
[Fleurs]: https://huggingface.co/datasets/google/fleurs |
|
|
|
[Link1]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_en_us_to_ko_kr.json |
|
[Link2]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_Fleurs_en_us_to_ko_kr.json |
|
[Link3]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_clean_en_us_to_ko_kr.json |
|
[Link4]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_other_en_us_to_ko_kr.json |
|
[Link5]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_en_us_to_ko_kr.json |
|
[Link6]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_Fleurs_en_us_to_ko_kr.json |
|
|
|
#### ASR |
|
|
|
| Benchmark | Task | BLEU ↑ | CER ↓ | WER ↓ | Result | |
|
| -------------------------------- |----------------|:-------------:|:------------:|:------------:|:-------------:| |
|
| [Covost2][Covost2] | ASR (English) | **86.09** | **4.12** | **7.83** | [Link][Link1] | |
|
| [Fleurs][Fleurs] | ASR (English) | **89.61** | **2.28** | **5.23** | [Link][Link2] | |
|
| [LibriSpeech-Clean][LibriSpeech] | ASR (English) | **94.28** | **0.98** | **2.91** | [Link][Link3] | |
|
| [LibriSpeech-Other][LibriSpeech] | ASR (English) | **87.60** | **3.10** | **6.55** | [Link][Link4] | |
|
|
|
|
|
#### AST |
|
|
|
| Benchmark | Task | BLEU ↑ | Result | |
|
| ------------------------------ |-------------------------------|:-------------:|:-------------:| |
|
| [Covost2][Covost2] | AST (0-shot, English-Korean) | 31.55 | [Link][Link5] | |
|
| [Fleurs][Fleurs] | AST (0-shot, English-Korean) | 11.05 | [Link][Link6] | |
|
|
|
|
|
## Model Details |
|
|
|
[junnei]: https://huggingface.co/junnei |
|
Developed by: [junnei][junnei] |
|
|
|
Model type: Multimodal (Text, Vision, Speech) Language Model |
|
|
|
Language(s): Multilingual |
|
|
|
License: [Gemma](https://ai.google.dev/gemma/terms) |
|
|
|
Base model: [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it) |
|
|
|
Inspiration: [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/phi-4-multimodal-instruct) |
|
|
|
## Training Details |
|
|
|
- The model was trained by adding a **596B parameter Speech LoRA adapter** to the base Gemma-3-4b-it model. |
|
|
|
- Due to limited computational resources, the model was **only trained for 1 epoch** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours. |
|
|
|
- The training data was limited to **English and Korean languages** from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within **less than 30 seconds in duration.** |
|
|
|
## Limitations |
|
|
|
Note that this model is **just a Proof of Concept (PoC) for experimental purposes** and is not intended for production use. |
|
To improve the model's performance and reliability, the following areas need further development: |
|
|
|
- More computational resources for extended training needed. |
|
|
|
- For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).** |
|
|
|
- Due to the lack of computing resources, |
|
this model **primarily recognizes audio files less than 30 seconds** in duration. |
|
As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs. |
|
|
|
- If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks. |
|
|
|
### Usage |
|
|
|
Below, there are some code snippets on how to get quickly started with running the model. |
|
|
|
First, install the Transformers library from git. AudioInput for chat_template is supported from main branch for now. |
|
|
|
```sh |
|
$ pip install git+https://github.com/huggingface/transformers |
|
``` |
|
|
|
Then, copy the snippet from the section that is relevant for your use case. |
|
|
|
#### Running the model with chat_template |
|
|
|
```python |
|
from transformers import AutoProcessor, AutoModel |
|
import torch |
|
|
|
model_id = "junnei/gemma-3-4b-it-speech" |
|
revision = "main" # or "v1.0". cause model is being updated, main branch could be crashed. |
|
|
|
model = AutoModel.from_pretrained( |
|
model_id, device_map="auto", revision = revision, trust_remote_code=True |
|
).eval() |
|
|
|
processor = AutoProcessor.from_pretrained( |
|
model_id, revision = revision, trust_remote_code=True |
|
) |
|
|
|
messages = [ |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"}, |
|
{"type": "text", "text": "Transcribe this audio clip into text."} |
|
] |
|
} |
|
] |
|
|
|
inputs = processor.apply_chat_template( |
|
messages, add_generation_prompt=True, tokenize=True, |
|
return_dict=True, return_tensors="pt" |
|
) |
|
|
|
with torch.inference_mode(): |
|
generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False) |
|
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :] |
|
response = processor.batch_decode( |
|
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
)[0] |
|
print(response) |
|
|
|
# What is shown in this image? |
|
``` |
|
|
|
|
|
#### Running the model with raw data |
|
|
|
```python |
|
from io import BytesIO |
|
from urllib.request import urlopen |
|
import soundfile |
|
from PIL import Image |
|
|
|
|
|
# get Audio data from URL |
|
url = "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav" |
|
audio, sr = soundfile.read(BytesIO(urlopen(url).read())) |
|
audio_token = '<start_of_audio>' |
|
|
|
|
|
messages = [ |
|
{'role': 'user', 'content': audio_token + 'Translate this audio into Korean.'}, |
|
] |
|
|
|
prompt = processor.tokenizer.apply_chat_template( |
|
messages, tokenize=False, add_generation_prompt=True |
|
) |
|
|
|
|
|
inputs = processor(text=prompt, audio=[audio], add_special_tokens=False, return_tensors="pt") |
|
|
|
with torch.inference_mode(): |
|
generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False) |
|
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :] |
|
response = processor.batch_decode( |
|
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False |
|
)[0] |
|
print(response) |
|
``` |
|
|
|
|
|
### Citation |
|
|
|
```none |
|
@article{gemma3mm_2025, |
|
title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities}, |
|
author={Seongjun Jang}, |
|
year={2025} |
|
} |
|
|
|
``` |