junnei's picture
Update README.md
ab0b2ea verified
|
raw
history blame
6.15 kB
---
license: gemma
library_name: transformers
base_model: google/gemma-3-4b-it
datasets:
- junnei/covost2
metrics:
- bleu
- cer
- wer
pipeline_tag: automatic-speech-recognition
---
# Gemma 3 MM model card
**Terms of Use**: [Terms][terms]
[terms]: https://ai.google.dev/gemma/terms
## Model Summary
**Gemma-3-MM** is a open multimodal instruction models that extend the
capabilities of the original Gemma-3 models to **include speech processing.**
These models leverage the language and vision research used in the
original Gemma-3 models and incorporate **additional speech processing
capabilities** through a Speech Adapter.
The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
~~The Gemma-3-MM family includes models of various sizes (1B, 4B, 12B, and 27B parameters),
with Gemma-3-4b-it-speech being a specific instance of this family.
These models maintain the original Gemma-3 capabilities while adding
multilingual speech recognition and translation abilities.~~
## Evaluation
Model evaluation metrics and results.
[Covost2]: https://huggingface.co/datasets/junnei/covost2
#### ASR
| Benchmark | Task | BLEU ↑ | CER ↓ | WER ↓ |
| ------------------------------ |----------------|:-------------:|:------------:|:------------:|
| [Covost2][Covost2] | ASR (English) | 85.95 | 4.47 | 8.49 |
#### AST
| Benchmark | Task | BLEU ↑ |
| ------------------------------ |-------------------------------|:-------------:|
| [Covost2][Covost2] | AST (0-shot, English-Korean) | 29.83 |
## Model Details
[junnei]: https://huggingface.co/junnei
Developed by: [junnei][junnei]
Model type: Multimodal (Text, Vision, Speech) Language Model
Language(s): Multilingual
License: [Gemma](https://ai.google.dev/gemma/terms)
Base model: [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)
Inspiration: [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/phi-4-multimodal-instruct)
## Training Details
- The model was trained by adding a **596B parameter Speech LoRA adapter** to the base Gemma-3-4b-it model.
- Due to limited computational resources, the model was **only trained for 1 epoch** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.
- The training data was limited to **English and Korean languages** from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within **less than 30 seconds in duration.**
## Limitations
Note that this model is **just a Proof of Concept (PoC) for experimental purposes** and is not intended for production use.
To improve the model's performance and reliability, the following areas need further development:
- More computational resources for extended training needed.
- For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).**
- Due to the lack of computing resources,
this model **primarily recognizes audio files less than 30 seconds** in duration.
As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
- If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
### Usage
Below, there are some code snippets on how to get quickly started with running the model.
First, install the Transformers library from git. AudioInput for chat_template is supported from main branch for now.
```sh
$ pip install git+https://github.com/huggingface/transformers
```
Then, copy the snippet from the section that is relevant for your use case.
#### Running the model with chat_template
```python
from transformers import AutoProcessor, AutoModel
import torch
model_id = "junnei/gemma-3-4b-it-speech"
model = AutoModel.from_pretrained(
model_id, device_map="auto", trust_remote_code=True
).eval()
processor = AutoProcessor.from_pretrained(
model_id, trust_remote_code=True
)
messages = [
{
"role": "user",
"content": [
{"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
{"type": "text", "text": "Transcribe this audio clip into text."}
]
}
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
)
with torch.inference_mode():
generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)
# What is shown in this image?
```
#### Running the model with raw data
```python
from io import BytesIO
from urllib.request import urlopen
import soundfile
from PIL import Image
# get Audio data from URL
url = "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"
audio, sr = soundfile.read(BytesIO(urlopen(url).read()))
audio_token = '<start_of_audio>'
messages = [
{'role': 'user', 'content': audio_token + 'Translate this audio into Korean.'},
]
prompt = processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=prompt, audio=[audio], add_special_tokens=False, return_tensors="pt")
with torch.inference_mode():
generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)
```
### Citation
```none
@article{gemma3mm_2025,
title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
author={Seongjun Jang},
year={2025}
}
```