MISHANM/google-gemma-3-27b-it-fp8

This model is an advanced fp8 quantized version of google/gemma-3-27b-it, meticulously crafted by experts for optimal deployment across compatible hardware ecosystems. By adopting the fp8 quantization technique, the model achieves remarkable computational efficiency, enabling significantly faster processing times and lowering resource consumption without compromising the exemplary performance standards of the original model. This quantized transformation is particularly advantageous for environments demanding high throughput and swift response times, ensuring that the model remains robust and reliable in handling complex tasks. Consequently, it exemplifies a state-of-the-art balance between performance enhancement and resource management, tailored for next-generation applications in diverse computational settings.

Model Details

  1. Tasks: Causal Language Modeling, Text Generation
  2. Base Model: google/gemma-3-27b-it
  3. Quantization Format: fp8

Device Used

  1. GPUs: 1*AMD Instinct™ MI210 Accelerators

Inference with HuggingFace


from transformers import AutoProcessor, Gemma3ForConditionalGeneration, BitsAndBytesConfig  
from PIL import Image  
import torch  
  
model_id = "MISHANM/google-gemma-3-27b-it-fp8"  
 
# Load the model with 8-bit quantization  
model = Gemma3ForConditionalGeneration.from_pretrained(  
   model_id, device_map="auto"
).eval()  
 
processor = AutoProcessor.from_pretrained(model_id)  
 
# Define chat messages for inference  
messages = [  
   {  
       "role": "system",  
       "content": [{"type": "text", "text": "You are a helpful assistant."}]  
   },  
   {  
       "role": "user",  
       "content": [  
           {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},  
           {"type": "text", "text": "Describe this image in detail."}  
       ]  
   }  
]  
 
# Prepare inputs for the model  
inputs = processor.apply_chat_template(  
   messages, add_generation_prompt=True, tokenize=True,  
   return_dict=True, return_tensors="pt"  
).to(model.device, dtype=torch.bfloat16)  
 
input_len = inputs["input_ids"].shape[-1]  
 
# Generate model output  
with torch.inference_mode():  
   generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)  
   generation = generation[0][input_len:]  
 
# Decode the generated output  
decoded = processor.decode(generation, skip_special_tokens=True)  
print(decoded)  


Citation Information

@misc{MISHANM/google-gemma-3-27b-it-fp8,
  author = {Mishan Maurya},
  title = {Introducing fp8 quantized version of google/gemma-3-27b-it},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  
}
Downloads last month
0
Safetensors
Model size
27.4B params
Tensor type
F32
·
FP16
·
I8
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for MISHANM/google-gemma-3-27b-it-fp8

Quantized
(20)
this model