The meta-llama/Llama-3.1-70B-Instruct model has been quantized using AutoRound and serialized in the GPTQ format at 4-bit precision, resulting in a 70% reduction in size while maintaining 99% of its original accuracy.
This quantization process was conducted by Sofya.
How to run
from transformers import AutoModelForCausalLM, AutoTokenizer
quantized_model = "sofya-ai/Meta-Llama-3.1-70B-Instruct-int4-auto-gptq"
model = AutoModelForCausalLM.from_pretrained(quantized_model,
device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(quantized_model)
text = "The patient was admitted to the hospital"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))