xmadai
/

Llama-3.2-1B-Instruct-xMADai-4bit

Text Generation

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

Llama-3.2-1B-Instruct-xMADai-4bit / README.md

JonahYixMAD's picture

Update README.md

8edf197 verified about 1 month ago

|

1.97 kB

	---
	library_name: transformers
	license: llama3.2
	base_model:
	- meta-llama/Llama-3.2-1B-Instruct
	---

	# This model has been xMADified!

	This repository contains [`meta-llama/Llama-3.2-1B-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) quantized from 16-bit floats to 4-bit integers, using xMAD.ai proprietary technology.

	# How to Run Model

	Loading the model checkpoint of this xMADified model requires less than 2 GiB of VRAM. Hence it can be efficiently run on most laptop GPUs.

	Package prerequisites: Run the following commands to install the required packages.
	```bash
	pip install -q --upgrade transformers accelerate optimum
	pip install -q --no-build-isolation auto-gptq
	```

	Sample Inference Code

	```python
	from transformers import AutoTokenizer
	from auto_gptq import AutoGPTQForCausalLM

	model_id = "xmadai/Llama-3.2-1B-Instruct-xMADai-4bit"
	prompt = [
	{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
	{"role": "user", "content": "What's Deep Learning?"},
	]

	tokenizer = AutoTokenizer.from_pretrained(model_id)

	inputs = tokenizer.apply_chat_template(
	prompt,
	tokenize=True,
	add_generation_prompt=True,
	return_tensors="pt",
	return_dict=True,
	).to("cuda")

	model = AutoGPTQForCausalLM.from_quantized(
	model_id,
	device_map='auto',
	trust_remote_code=True,
	)

	outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
	print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
	```

	Other xMADified models and their GPU memory requirements are listed below.

	Model \| GPU Memory Requirement
	--- \| ---
	Llama-3.2-3B-Instruct-xMADai-4bit \| 6.5 GB → 3.5 GB
	Llama-3.2-1B-Instruct-xMADai-4bit \| 2.5 → 2 GB
	Llama-3.1-405B-Instruct-xMADai-4bit \| 258.14 GB → 250 GB
	Llama-3.1-8B-Instruct-xMADai-4bit \| 16 → 7 GB

	For additional xMADified models, access to fine-tuning, and general questions, please contact us at [email protected] and join our waiting list.