|
The [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) model has been quantized using [AutoRound](https://github.com/intel/auto-round) and serialized in the GPTQ format at 4-bit precision, resulting in a 70% reduction in size while maintaining 99% of its original accuracy. |
|
|
|
This quantization process was conducted by [Sofya](https://www.sofya.ai/). |
|
|
|
### How to run |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
quantized_model = "sofya-ai/Meta-Llama-3.1-70B-Instruct-int4-auto-gptq" |
|
model = AutoModelForCausalLM.from_pretrained(quantized_model, |
|
device_map="auto") |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(quantized_model) |
|
text = "The patient was admitted to the hospital" |
|
inputs = tokenizer(text, return_tensors="pt").to(model.device) |
|
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0])) |
|
``` |