Update README.md

ab0b2ea verified 3 months ago

6.15 kB

	---
	license: gemma
	library_name: transformers
	base_model: google/gemma-3-4b-it
	datasets:
	- junnei/covost2
	metrics:
	- bleu
	- cer
	- wer
	pipeline_tag: automatic-speech-recognition
	---

	# Gemma 3 MM model card


	Terms of Use: [Terms][terms]

	[terms]: https://ai.google.dev/gemma/terms

	## Model Summary

	Gemma-3-MM is a open multimodal instruction models that extend the
	capabilities of the original Gemma-3 models to include speech processing.

	These models leverage the language and vision research used in the
	original Gemma-3 models and incorporate **additional speech processing
	capabilities** through a Speech Adapter.

	The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).

	~~The Gemma-3-MM family includes models of various sizes (1B, 4B, 12B, and 27B parameters),
	with Gemma-3-4b-it-speech being a specific instance of this family.
	These models maintain the original Gemma-3 capabilities while adding
	multilingual speech recognition and translation abilities.~~

	## Evaluation

	Model evaluation metrics and results.

	[Covost2]: https://huggingface.co/datasets/junnei/covost2

	#### ASR

	\| Benchmark \| Task \| BLEU ↑ \| CER ↓ \| WER ↓ \|
	\| ------------------------------ \|----------------\|:-------------:\|:------------:\|:------------:\|
	\| [Covost2][Covost2] \| ASR (English) \| 85.95 \| 4.47 \| 8.49 \|


	#### AST

	\| Benchmark \| Task \| BLEU ↑ \|
	\| ------------------------------ \|-------------------------------\|:-------------:\|
	\| [Covost2][Covost2] \| AST (0-shot, English-Korean) \| 29.83 \|


	## Model Details

	[junnei]: https://huggingface.co/junnei
	Developed by: [junnei][junnei]

	Model type: Multimodal (Text, Vision, Speech) Language Model

	Language(s): Multilingual

	License: [Gemma](https://ai.google.dev/gemma/terms)

	Base model: [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)

	Inspiration: [Phi-4-multimodal-instruct](https://huggingface.co/microsoft/phi-4-multimodal-instruct)

	## Training Details

	- The model was trained by adding a 596B parameter Speech LoRA adapter to the base Gemma-3-4b-it model.

	- Due to limited computational resources, the model was only trained for 1 epoch on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU in 12 hours.

	- The training data was limited to English and Korean languages from the [Covost2 Dataset](https://huggingface.co/datasets/junnei/covost2) within less than 30 seconds in duration.

	## Limitations

	Note that this model is just a Proof of Concept (PoC) for experimental purposes and is not intended for production use.
	To improve the model's performance and reliability, the following areas need further development:

	- More computational resources for extended training needed.

	- For now, the model only works for Vision-Language tasks and Audio-Language tasks (ASR/AST).

	- Due to the lack of computing resources,
	this model primarily recognizes audio files less than 30 seconds in duration.
	As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.

	- If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.

	### Usage

	Below, there are some code snippets on how to get quickly started with running the model.

	First, install the Transformers library from git. AudioInput for chat_template is supported from main branch for now.

	```sh
	$ pip install git+https://github.com/huggingface/transformers
	```

	Then, copy the snippet from the section that is relevant for your use case.

	#### Running the model with chat_template

	```python
	from transformers import AutoProcessor, AutoModel
	import torch

	model_id = "junnei/gemma-3-4b-it-speech"

	model = AutoModel.from_pretrained(
	model_id, device_map="auto", trust_remote_code=True
	).eval()

	processor = AutoProcessor.from_pretrained(
	model_id, trust_remote_code=True
	)

	messages = [
	{
	"role": "user",
	"content": [
	{"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
	{"type": "text", "text": "Transcribe this audio clip into text."}
	]
	}
	]

	inputs = processor.apply_chat_template(
	messages, add_generation_prompt=True, tokenize=True,
	return_dict=True, return_tensors="pt"
	)

	with torch.inference_mode():
	generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
	generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
	response = processor.batch_decode(
	generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)[0]
	print(response)

	# What is shown in this image?
	```


	#### Running the model with raw data

	```python
	from io import BytesIO
	from urllib.request import urlopen
	import soundfile
	from PIL import Image


	# get Audio data from URL
	url = "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"
	audio, sr = soundfile.read(BytesIO(urlopen(url).read()))
	audio_token = '<start_of_audio>'


	messages = [
	{'role': 'user', 'content': audio_token + 'Translate this audio into Korean.'},
	]

	prompt = processor.tokenizer.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)


	inputs = processor(text=prompt, audio=[audio], add_special_tokens=False, return_tensors="pt")

	with torch.inference_mode():
	generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
	generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
	response = processor.batch_decode(
	generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)[0]
	print(response)
	```


	### Citation

	```none
	@article{gemma3mm_2025,
	title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
	author={Seongjun Jang},
	year={2025}
	}

	```