File size: 1,971 Bytes
6577183
 
df55c5a
 
 
6577183
 
df55c5a
6577183
84907ce
6577183
df55c5a
6577183
df55c5a
6577183
df55c5a
 
 
 
 
6577183
df55c5a
6577183
df55c5a
 
 
6577183
df55c5a
 
 
 
 
6577183
df55c5a
6577183
df55c5a
 
 
 
 
 
 
6577183
df55c5a
 
 
 
 
6577183
df55c5a
 
 
6577183
8edf197
 
 
 
 
 
 
 
 
df55c5a
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
---
library_name: transformers
license: llama3.2
base_model:
- meta-llama/Llama-3.2-1B-Instruct
---

# This model has been xMADified!

This repository contains [`meta-llama/Llama-3.2-1B-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) quantized from 16-bit floats to 4-bit integers, using xMAD.ai proprietary technology.

# How to Run Model

Loading the model checkpoint of this xMADified model requires less than 2 GiB of VRAM. Hence it can be efficiently run on most laptop GPUs.

**Package prerequisites**: Run the following commands to install the required packages.
```bash
pip install -q --upgrade transformers accelerate optimum
pip install -q --no-build-isolation auto-gptq
```

**Sample Inference Code**

```python
from transformers import AutoTokenizer
from auto_gptq import AutoGPTQForCausalLM

model_id = "xmadai/Llama-3.2-1B-Instruct-xMADai-4bit"
prompt = [
  {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."},
  {"role": "user", "content": "What's Deep Learning?"},
]

tokenizer = AutoTokenizer.from_pretrained(model_id)

inputs = tokenizer.apply_chat_template(
  prompt,
  tokenize=True,
  add_generation_prompt=True,
  return_tensors="pt",
  return_dict=True,
).to("cuda")

model = AutoGPTQForCausalLM.from_quantized(
    model_id,
    device_map='auto',
    trust_remote_code=True,
)

outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256)
print(tokenizer.batch_decode(outputs, skip_special_tokens=True))
```

Other xMADified models and their GPU memory requirements are listed below.

Model | GPU Memory Requirement
--- | ---
Llama-3.2-3B-Instruct-xMADai-4bit | 6.5 GB → 3.5 GB
Llama-3.2-1B-Instruct-xMADai-4bit | 2.5 → 2 GB
Llama-3.1-405B-Instruct-xMADai-4bit | 258.14 GB → 250 GB
Llama-3.1-8B-Instruct-xMADai-4bit | 16 → 7 GB

For additional xMADified models, access to fine-tuning, and general questions, please contact us at [email protected] and join our waiting list.