--- base_model: - meta-llama/Llama-3.1-70B-Instruct --- The [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) model has been quantized using [AutoRound](https://github.com/intel/auto-round) and serialized in the GPTQ format at 4-bit precision. This process achieved an impressive **70% reduction in model size** while retaining **99% of its original accuracy**, ensuring both efficiency and precision for real-world applications. ### How to run ```python from transformers import AutoModelForCausalLM, AutoTokenizer quantized_model = "sofya-ai/Meta-Llama-3.1-70B-Instruct-int4-auto-gptq" model = AutoModelForCausalLM.from_pretrained(quantized_model, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(quantized_model) text = "The patient was admitted to the hospital" inputs = tokenizer(text, return_tensors="pt").to(model.device) output = tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]) ``` This quantization process was conducted by [Sofya](https://www.sofya.ai/) to make large-scale language models more accessible.