|
--- |
|
base_model: |
|
- meta-llama/Llama-3.2-3B-Instruct |
|
library_name: transformers |
|
license: llama3.2 |
|
--- |
|
|
|
# This model has been xMADified! |
|
|
|
This repository contains [`meta-llama/Llama-3.2-3B-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) quantized from 16-bit floats to 4-bit integers, using xMAD.ai proprietary technology. |
|
|
|
# How to Run Model |
|
|
|
Loading the model checkpoint of this xMADified model requires less than 3 GiB of VRAM. Hence it can be efficiently run on most laptop GPUs. |
|
|
|
**Package prerequisites**: Run the following commands to install the required packages. |
|
```bash |
|
pip install -q --upgrade transformers accelerate optimum |
|
pip install -q --no-build-isolation auto-gptq |
|
``` |
|
|
|
**Sample Inference Code** |
|
|
|
```python |
|
from transformers import AutoTokenizer |
|
from auto_gptq import AutoGPTQForCausalLM |
|
|
|
model_id = "xmadai/Llama-3.2-3B-Instruct-xMADai-4bit" |
|
prompt = [ |
|
{"role": "system", "content": "You are a helpful assistant, that responds as a pirate."}, |
|
{"role": "user", "content": "What's Deep Learning?"}, |
|
] |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
inputs = tokenizer.apply_chat_template( |
|
prompt, |
|
tokenize=True, |
|
add_generation_prompt=True, |
|
return_tensors="pt", |
|
return_dict=True, |
|
).to("cuda") |
|
|
|
model = AutoGPTQForCausalLM.from_quantized( |
|
model_id, |
|
device_map='auto', |
|
trust_remote_code=True, |
|
) |
|
|
|
outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256) |
|
print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) |
|
``` |
|
|
|
For additional xMADified models, access to fine-tuning, and general questions, please contact us at [email protected] and join our waiting list. |