brunosdorneles commited on
Commit
d3ff123
1 Parent(s): 18ab45d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +18 -0
README.md ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ The [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) model has been quantized using [AutoRound](https://github.com/intel/auto-round) and serialized in the GPTQ format at 4-bit precision, resulting in a 70% reduction in size while maintaining 99% of its original accuracy.
2
+
3
+ This quantization process was conducted by [Sofya](https://www.sofya.ai/).
4
+
5
+ ### How to run
6
+
7
+ ```python
8
+ from transformers import AutoModelForCausalLM, AutoTokenizer
9
+
10
+ quantized_model = "sofya-ai/Meta-Llama-3.1-70B-Instruct-int4-auto-gptq"
11
+ model = AutoModelForCausalLM.from_pretrained(quantized_model,
12
+ device_map="auto")
13
+
14
+ tokenizer = AutoTokenizer.from_pretrained(quantized_model)
15
+ text = "The patient was admitted to the hospital"
16
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
17
+ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
18
+ ```