sofya-ai
/

Meta-Llama-3.1-70B-Instruct-int4-auto-gptq

4-bit precision

Model card Files Files and versions Community

Meta-Llama-3.1-70B-Instruct-int4-auto-gptq / README.md

brunosdorneles's picture

Create README.md

d3ff123 verified about 1 month ago

|

976 Bytes

	The [meta-llama/Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) model has been quantized using [AutoRound](https://github.com/intel/auto-round) and serialized in the GPTQ format at 4-bit precision, resulting in a 70% reduction in size while maintaining 99% of its original accuracy.

	This quantization process was conducted by [Sofya](https://www.sofya.ai/).

	### How to run

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	quantized_model = "sofya-ai/Meta-Llama-3.1-70B-Instruct-int4-auto-gptq"
	model = AutoModelForCausalLM.from_pretrained(quantized_model,
	device_map="auto")

	tokenizer = AutoTokenizer.from_pretrained(quantized_model)
	text = "The patient was admitted to the hospital"
	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
	```