junnei's picture
Update README.md
da4e0a4 verified
|
raw
history blame
7.81 kB
---
license: gemma
library_name: transformers
base_model: google/gemma-3-4b-it
datasets:
- junnei/covost2
metrics:
- bleu
- cer
- wer
pipeline_tag: automatic-speech-recognition
---
# Gemma 3 MM model card
**Terms of Use**: [Terms][terms]
[terms]: https://ai.google.dev/gemma/terms
## Model Summary
**Gemma-3-MM** is a open multimodal instruction models that extend the
capabilities of the original Gemma-3 models to **include speech processing.**
These models leverage the language and vision research used in the
original Gemma-3 models and incorporate **additional speech processing
capabilities** through a Speech Adapter.
The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
~~The Gemma-3-MM family includes models of various sizes (1B, 4B, 12B, and 27B parameters),
with Gemma-3-4b-it-speech being a specific instance of this family.
These models maintain the original Gemma-3 capabilities while adding
multilingual speech recognition and translation abilities.~~
## Evaluation
Model evaluation metrics and results.
Here is [Script][Script] to evaluate model.
[Script]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/examples/evaluate_speech.py
[Covost2]: https://huggingface.co/datasets/junnei/covost2
[LibriSpeech]: https://huggingface.co/datasets/fixie-ai/librispeech_asr
[Fleurs]: https://huggingface.co/datasets/google/fleurs
[Link1]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_en_us_to_ko_kr.json
[Link2]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_Fleurs_en_us_to_ko_kr.json
[Link3]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_clean_en_us_to_ko_kr.json
[Link4]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/asr_LibriSpeech_other_en_us_to_ko_kr.json
[Link5]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_en_us_to_ko_kr.json
[Link6]: https://huggingface.co/junnei/gemma-3-4b-it-speech/blob/main/eval_result/translation_Fleurs_en_us_to_ko_kr.json
#### ASR
| Benchmark | Task | BLEU ↑ | CER ↓ | WER ↓ | Result |
| -------------------------------- |----------------|:-------------:|:------------:|:------------:|:-------------:|
| [Covost2][Covost2] | ASR (English) | **86.09** | **4.12** | **7.83** | [Link][Link1] |
| [Fleurs][Fleurs] | ASR (English) | **89.61** | **2.28** | **5.23** | [Link][Link2] |
| [LibriSpeech-Clean][LibriSpeech] | ASR (English) | **94.28** | **0.98** | **2.91** | [Link][Link3] |
| [LibriSpeech-Other][LibriSpeech] | ASR (English) | **87.60** | **3.10** | **6.55** | [Link][Link4] |
#### AST
| Benchmark | Task | BLEU ↑ | Result |
| ------------------------------ |-------------------------------|:-------------:|:-------------:|
| [Covost2][Covost2] | AST (0-shot, English-Korean) | 31.55 | [Link][Link5] |
| [Fleurs][Fleurs] | AST (0-shot, English-Korean) | 11.05 | [Link][Link6] |
## Model Details
[junnei]: https://huggingface.co/junnei
Developed by: [junnei][junnei]
Model type: Multimodal (Text, Vision, Speech) Language Model
Language(s): Multilingual
License: [Gemma](https://ai.google.dev/gemma/terms)
Base model: [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)
Inspiration: [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/phi-4-multimodal-instruct)
## Training Details
- The model was trained by adding a **596B parameter Speech LoRA adapter** to the base Gemma-3-4b-it model.
- Due to limited computational resources, the model was **only trained for 1 epoch** on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.
- The training data was limited to **English and Korean languages** from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within **less than 30 seconds in duration.**
## Limitations
Note that this model is **just a Proof of Concept (PoC) for experimental purposes** and is not intended for production use.
To improve the model's performance and reliability, the following areas need further development:
- More computational resources for extended training needed.
- For now, the model only works for Vision-Language tasks and **Audio-Language tasks (ASR/AST).**
- Due to the lack of computing resources,
this model **primarily recognizes audio files less than 30 seconds** in duration.
As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
- If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
### Usage
Below, there are some code snippets on how to get quickly started with running the model.
First, install the Transformers library from git. AudioInput for chat_template is supported from main branch for now.
```sh
$ pip install git+https://github.com/huggingface/transformers
```
Then, copy the snippet from the section that is relevant for your use case.
#### Running the model with chat_template
```python
from transformers import AutoProcessor, AutoModel
import torch
model_id = "junnei/gemma-3-4b-it-speech"
revision = "main" # or "v1.0". cause model is being updated, main branch could be crashed.
model = AutoModel.from_pretrained(
model_id, device_map="auto", revision = revision, trust_remote_code=True
).eval()
processor = AutoProcessor.from_pretrained(
model_id, revision = revision, trust_remote_code=True
)
messages = [
{
"role": "user",
"content": [
{"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
{"type": "text", "text": "Transcribe this audio clip into text."}
]
}
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
)
with torch.inference_mode():
generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)
# What is shown in this image?
```
#### Running the model with raw data
```python
from io import BytesIO
from urllib.request import urlopen
import soundfile
from PIL import Image
# get Audio data from URL
url = "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"
audio, sr = soundfile.read(BytesIO(urlopen(url).read()))
audio_token = '<start_of_audio>'
messages = [
{'role': 'user', 'content': audio_token + 'Translate this audio into Korean.'},
]
prompt = processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=prompt, audio=[audio], add_special_tokens=False, return_tensors="pt")
with torch.inference_mode():
generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)
```
### Citation
```none
@article{gemma3mm_2025,
title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
author={Seongjun Jang},
year={2025}
}
```