Gemma 3 MM model card

Terms of Use: Terms

Model Summary

Gemma-3-MM is a open multimodal instruction models that extend the capabilities of the original Gemma-3 models to include speech processing.

These models leverage the language and vision research used in the original Gemma-3 models and incorporate additional speech processing capabilities through a Speech Adapter.

The models can process text, image, and audio inputs, generating text outputs, and come with a 128K token context length (32K for the 1B model).

Evaluation

Model evaluation metrics and results.

Here is Script to evaluate model.

ASR

Benchmark	Task	BLEU ↑	CER ↓	WER ↓	Result
Covost2	ASR (English)	86.09	4.12	7.83	Link
Fleurs	ASR (English)	89.61	2.28	5.23	Link
LibriSpeech-Clean	ASR (English)	94.28	0.98	2.91	Link
LibriSpeech-Other	ASR (English)	87.60	3.10	6.55	Link

AST

Benchmark	Task	BLEU ↑	Result
Covost2	AST (0-shot, English-Korean)	31.55	Link
Fleurs	AST (0-shot, English-Korean)	11.05	Link

(Experimental) ASR : Korean Branch

Score is lower because Korean Normalizer is not applied

Benchmark	Task	BLEU ↑	CER ↓	WER ↓	Result
Zeroth	ASR (Korean)	94.91	1.31	2.50	Link
Fleurs	ASR (Korean)	62.83	9.08	23.0	Link
Covost2	ASR (Korean)	43.66	22.5	41.4	Link

Model Details

Developed by: junnei

Model type: Multimodal (Text, Vision, Speech) Language Model

Language(s): Multilingual

License: Gemma

Base model: google/gemma-3-4b-it

Inspiration: Phi-4-multimodal-instruct

Training Details

The model was trained by adding a 596B parameter Speech LoRA adapter to the base Gemma-3-4b-it model.
Due to limited computational resources, the model was only trained for limited datasets and epochs on ASR (Automatic Speech Recognition) and AST (Automatic Speech Translation) tasks with A100 1 GPU.
The training data was limited to English and Korean languages within less than 30 seconds in duration.

Datasets

ASR / AST

Limitations

Note that this model is just a Proof of Concept (PoC) for experimental purposes and is not intended for production use. To improve the model's performance and reliability, the following areas need further development:

More computational resources for extended training needed.
For now, the model only works for Vision-Language tasks and Audio-Language tasks (ASR/AST).
Due to the lack of computing resources, this model primarily recognizes audio files less than 30 seconds in duration. As a result, there is a limitation where the accuracy may drop significantly for longer audio inputs.
If possible, We will train the model for Speech-Vision Tasks and more Audio-Language tasks.

Usage

Below, there are some code snippets on how to get quickly started with running the model.

First, upgrade your Transformers library. AudioInput for chat_template is supported now.

$ pip install -U transformers

Then, copy the snippet from the section that is relevant for your use case.

Running the model with chat_template

from transformers import AutoProcessor, AutoModel
import torch

model_id = "junnei/gemma-3-4b-it-speech"
revision = "main" # or "korean".

model = AutoModel.from_pretrained(
    model_id, device_map="auto", revision = revision, trust_remote_code=True
).eval()

processor = AutoProcessor.from_pretrained(
    model_id, revision = revision, trust_remote_code=True
)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"},
            {"type": "text", "text": "Transcribe this audio clip into text."}
        ]
    }
]

inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
)

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
print(response)

# What is shown in this image?

Running the model with raw data

from io import BytesIO
from urllib.request import urlopen
import soundfile
from PIL import Image


# get Audio data from URL
url = "https://huggingface.co/microsoft/Phi-4-multimodal-instruct/resolve/main/examples/what_is_shown_in_this_image.wav"
audio, sr = soundfile.read(BytesIO(urlopen(url).read()))
audio_token = '<start_of_audio>'


messages = [
    {'role': 'user', 'content': audio_token + 'Translate this audio into Korean.'},
]

prompt = processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)


inputs = processor(text=prompt, audio=[audio], add_special_tokens=False, return_tensors="pt")

with torch.inference_mode():
    generate_ids = model.generate(**inputs, max_new_tokens=128, do_sample=False)
    generate_ids = generate_ids[:, inputs['input_ids'].shape[1] :]
    response = processor.batch_decode(
        generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )[0]
print(response)

Finetune the model

Here is finetuning script : Link

You must change output_dir, upload_dir and fit your Datasets

python finetune_speech.py

Citation

@article{gemma3mm_2025,
    title={Gemma-3-MM: Multimodal Language Models with Speech Capabilities},
    author={Seongjun Jang},
    year={2025}
}

junnei
/

gemma-3-4b-it-speech