--- license: gemma base_model: - google/gemma-2-27b-it pipeline_tag: text-generation --- FP-8 quantized version of google/gemma-2-27b-it quantized with **compute sponsored by Arrow and Nvidia through Danish Data Science Community**. Quantized using this script: ```python from llmcompressor.transformers import SparseAutoModelForCausalLM from transformers import AutoTokenizer from llmcompressor.transformers import oneshot from llmcompressor.modifiers.quantization import QuantizationModifier MODEL_ID = "google/gemma-2-27b-it" model = SparseAutoModelForCausalLM.from_pretrained( MODEL_ID, device_map="auto", torch_dtype="auto") tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) # Configure the simple PTQ quantization recipe = QuantizationModifier( targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"]) # Apply the quantization algorithm. oneshot(model=model, recipe=recipe) # Save the model. SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic" model.save_pretrained(SAVE_DIR) tokenizer.save_pretrained(SAVE_DIR) ```