--- library_name: transformers license: llama3.2 base_model: - meta-llama/Llama-3.2-1B-Instruct --- # This model has been xMADified! This repository contains [`meta-llama/Llama-3.2-1B-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) quantized from 16-bit floats to 4-bit integers, using xMAD.ai proprietary technology. # How to Run Model Loading the model checkpoint of this xMADified model requires less than 2 GiB of VRAM. Hence it can be efficiently run on most laptop GPUs. **Package prerequisites**: Run the following commands to install the required packages. ```bash pip install -q --upgrade transformers accelerate optimum pip install -q --no-build-isolation auto-gptq ``` **Sample Inference Code** ```python from transformers import AutoTokenizer from auto_gptq import AutoGPTQForCausalLM model_id = "xmadai/Llama-3.2-1B-Instruct-xMADai-4bit" prompt = [ {"role": "system", "content": "You are a helpful assistant, that responds as a pirate."}, {"role": "user", "content": "What's Deep Learning?"}, ] tokenizer = AutoTokenizer.from_pretrained(model_id) inputs = tokenizer.apply_chat_template( prompt, tokenize=True, add_generation_prompt=True, return_tensors="pt", return_dict=True, ).to("cuda") model = AutoGPTQForCausalLM.from_quantized( model_id, device_map='auto', trust_remote_code=True, ) outputs = model.generate(**inputs, do_sample=True, max_new_tokens=256) print(tokenizer.batch_decode(outputs, skip_special_tokens=True)) ``` Model | GPU Memory Requirement --- | --- Llama-3.2-3B-Instruct-xMADai-4bit | 6.5 GB → 3.5 GB Llama-3.2-1B-Instruct-xMADai-4bit | 2.5 → 2 GB Llama-3.1-405B-Instruct-xMADai-4bit | 800 GB (16 H100s) → 250 GB (8 V100) Llama-3.1-8B-Instruct-xMADai-4bit | 16 → 7 GB For additional xMADified models, access to fine-tuning, and general questions, please contact us at support@xmad.ai and join our waiting list.