license: gemma
library_name: transformers
base_model: google/gemma-3-4b-it
datasets:
- junnei/covost2
metrics:
- bleu
- cer
- wer
pipeline_tag: automatic-speech-recognition
Gemma 3 MM model card
Terms of Use: Terms
Model Summary
Gemma-3-MM is a open multimodal instruction models that extend the capabilities of the original Gemma-3 models to include speech processing.
These models leverage the language and vision research used in the original Gemma-3 models and incorporate additional speech processing capabilities through a Speech Adapter.
The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
The Gemma-3-MM family includes models of various sizes (1B, 4B, 12B, and 27B parameters),
with Gemma-3-4b-it-speech being a specific instance of this family.
These models maintain the original Gemma-3 capabilities while adding
multilingual speech recognition and translation abilities.
Evaluation
Model evaluation metrics and results.
ASR
Benchmark | Task | BLEU ↑ | CER ↓ | WER ↓ |
---|---|---|---|---|
Covost2 | ASR (English) | 85.95 | 4.47 | 8.49 |
AST
Benchmark | Task | BLEU ↑ |
---|---|---|
Covost2 | AST (0-shot, English-Korean) | 29.83 |
Model Details
Developed by: junnei
Model type: Multimodal (Text, Vision, Speech) Language Model
Language(s): Multilingual
License: Gemma
Base model: google/gemma-3-4b-it
Inspiration: Phi-4-multimodal-instruct
Training Details
The model was trained by adding a 596B parameter Speech LoRA adapter to the base Gemma-3-4b-it model.
Due to limited computational resources, the model was only trained for 1 epoch on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.
The training data was limited to English and Korean languages from the Covost2 Dataset within less than 30 seconds in duration.
Limitations
Note that this model is just a Proof of Concept (PoC) for experimental purposes and is not intended for production use. To improve the model's performance and reliability, the following areas need further development:
More computational resources for extended training needed.
For now, the model only works for Vision-Language tasks and Audio-Language tasks (ASR/AST).
Due to the lack of computing resources, this model primarily recognizes audio files less than 30 seconds in duration. As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
Usage
Below, there are some code snippets on how to get quickly started with running the model.
First, install the Transformers library from git. AudioInput for chat_template is supported from main branch for now.
$ pip install git+https://github.com/huggingface/transformers
Then, copy the snippet from the section that is relevant for your use case.
Running the model with chat_template
from transformers import AutoProcessor, AutoModel
import torch
model_id = "junnei/gemma-3-4b-it-speech"
revision = "v1.0" # cause model is being updated, main branch could be crashed.
model = AutoModel.from_pretrained(
model_id, device_map="auto", revision = revision, trust_remote_code=True
).eval()
processor = AutoProcessor.from_pretrained(
model_id, revision = revision, trust_remote_code=True
)
messages = [
{
"role": "user",
"content": [
{"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
{"type": "text", "text": "Transcribe this audio clip into text."}
]
}
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
)
with torch.inference_mode():
generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)
# What is shown in this image?
Running the model with raw data
from io import BytesIO
from urllib.request import urlopen
import soundfile
from PIL import Image
# get Audio data from URL
url = "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"
audio, sr = soundfile.read(BytesIO(urlopen(url).read()))
audio_token = '<start_of_audio>'
messages = [
{'role': 'user', 'content': audio_token + 'Translate this audio into Korean.'},
]
prompt = processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=prompt, audio=[audio], add_special_tokens=False, return_tensors="pt")
with torch.inference_mode():
generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)
Citation
@article{gemma3mm_2025,
title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
author={Seongjun Jang},
year={2025}
}