Gemma 3 MM model card
Terms of Use: Terms
Model Summary
Gemma-3-MM is a open multimodal instruction models that extend the capabilities of the original Gemma-3 models to include speech processing.
These models leverage the language and vision research used in the original Gemma-3 models and incorporate additional speech processing capabilities through a Speech Adapter.
The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).
Evaluation
Model evaluation metrics and results.
Here is Script to evaluate model.
ASR
Benchmark | Task | BLEU โ | CER โ | WER โ | Result |
---|---|---|---|---|---|
Covost2 | ASR (English) | 86.09 | 4.12 | 7.83 | Link |
Fleurs | ASR (English) | 89.61 | 2.28 | 5.23 | Link |
LibriSpeech-Clean | ASR (English) | 94.28 | 0.98 | 2.91 | Link |
LibriSpeech-Other | ASR (English) | 87.60 | 3.10 | 6.55 | Link |
AST
Benchmark | Task | BLEU โ | Result |
---|---|---|---|
Covost2 | AST (0-shot, English-Korean) | 31.55 | Link |
Fleurs | AST (0-shot, English-Korean) | 11.05 | Link |
(Experimental) ASR : Korean Branch
Score is lower because Korean Normalizer is not applied
Benchmark | Task | BLEU โ | CER โ | WER โ | Result |
---|---|---|---|---|---|
Zeroth | ASR (Korean) | 94.91 | 1.31 | 2.50 | Link |
Fleurs | ASR (Korean) | 62.83 | 9.08 | 23.0 | Link |
Covost2 | ASR (Korean) | 43.66 | 22.5 | 41.4 | Link |
Model Details
Developed by: junnei
Model type: Multimodal (Text, Vision, Speech) Language Model
Language(s): Multilingual
License: Gemma
Base model: google/gemma-3-4b-it
Inspiration: Phi-4-multimodal-instruct
Training Details
The model was trained by adding a 596B parameter Speech LoRA adapter to the base Gemma-3-4b-it model.
Due to limited computational resources, the model was only trained for limited datasets and epochs on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU.
The training data was limited to English and Korean languages within less than 30 seconds in duration.
Datasets
ASR / AST
Limitations
Note that this model is just a Proof of Concept (PoC) for experimental purposes and is not intended for production use. To improve the model's performance and reliability, the following areas need further development:
More computational resources for extended training needed.
For now, the model only works for Vision-Language tasks and Audio-Language tasks (ASR/AST).
Due to the lack of computing resources, this model primarily recognizes audio files less than 30 seconds in duration. As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.
Usage
Below, there are some code snippets on how to get quickly started with running the model.
First, upgrade your Transformers library. AudioInput for chat_template is supported now.
$ pip install -U transformers
Then, copy the snippet from the section that is relevant for your use case.
Running the model with chat_template
from transformers import AutoProcessor, AutoModel
import torch
model_id = "junnei/gemma-3-4b-it-speech"
revision = "main" # or "korean".
model = AutoModel.from_pretrained(
model_id, device_map="auto", revision = revision, trust_remote_code=True
).eval()
processor = AutoProcessor.from_pretrained(
model_id, revision = revision, trust_remote_code=True
)
messages = [
{
"role": "user",
"content": [
{"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
{"type": "text", "text": "Transcribe this audio clip into text."}
]
}
]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
)
with torch.inference_mode():
generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)
# What is shown in this image?
Running the model with raw data
from io import BytesIO
from urllib.request import urlopen
import soundfile
from PIL import Image
# get Audio data from URL
url = "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"
audio, sr = soundfile.read(BytesIO(urlopen(url).read()))
audio_token = '<start_of_audio>'
messages = [
{'role': 'user', 'content': audio_token + 'Translate this audio into Korean.'},
]
prompt = processor.tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = processor(text=prompt, audio=[audio], add_special_tokens=False, return_tensors="pt")
with torch.inference_mode():
generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
response = processor.batch_decode(
generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(response)
Finetune the model
Here is finetuning script : Link
You must change output_dir, upload_dir and fit your Datasets
python finetune_speech.py
Citation
@article{gemma3mm_2025,
title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
author={Seongjun Jang},
year={2025}
}
- Downloads last month
- 496