xmadai
/

Llama-3.2-3B-Instruct-xMADai-INT4

Text Generation

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

Llama-3.2-3B-Instruct-xMADai-INT4 / README.md

onebitquantized's picture

onebitquantized

Upload tokenizer

4a8b5e9 verified 23 days ago

|

1.64 kB

	---
	base_model:
	- meta-llama/Llama-3.2-3B-Instruct
	library_name: transformers
	license: llama3.2
	---

	# This model has been xMADified!

	This repository contains [`meta-llama/Llama-3.2-3B-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) quantized from 16-bit floats to 4-bit integers, using xMAD.ai proprietary technology.

	# How to Run Model

	Loading the model checkpoint of this xMADified model requires less than 3 GiB of VRAM. Hence it can be efficiently run on most laptop GPUs.

	Package prerequisites: Run the following commands to install the required packages.
	```bash
	pip install -q --upgrade transformers accelerate optimum
	pip install -q --no-build-isolation auto-gptq
	```

	Sample Inference Code

	```python
	from transformers import AutoTokenizer
	from auto_gptq import AutoGPTQForCausalLM

	model_id = "xmadai/Llama-3.2-3B-Instruct-xMADai-4bit"
	prompt = [
	{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
	{"role": "user", "content": "What's Deep Learning?"},
	]

	tokenizer = AutoTokenizer.from_pretrained(model_id)

	inputs = tokenizer.apply_chat_template(
	prompt,
	tokenize=True,
	add_generation_prompt=True,
	return_tensors="pt",
	return_dict=True,
	).to("cuda")

	model = AutoGPTQForCausalLM.from_quantized(
	model_id,
	device_map='auto',
	trust_remote_code=True,
	)

	outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
	print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
	```

	For additional xMADified models, access to fine-tuning, and general questions, please contact us at [email protected] and join our waiting list.