--- base_model: [meta-llama/Meta-Llama-3.1-405B-Instruct] --- # 🚀 CPU optimized quantizations of [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) 🖥️ This repository contains CPU-optimized GGUF quantizations of the Meta-Llama-3.1-405B-Instruct model. These quantizations are designed to run efficiently on CPU hardware while maintaining good performance. ## Available Quantizations Available Quantizations 1. Q4_0_4_8 (CPU FMA-Optimized): ~246 GB 2. IQ4_XS (Fastest for CPU/GPU): ~212 GB 3. Q2K-Q8 Mixed quant with iMatrix: ~154 GB 4. Q2K-Q8 Mixed without iMat for testing: ~165 GB 5. 1-bit Custom per weight COHERENT quant: ~103 GB 6. BF16: ~811 GB (original model) 7. Q8_0: ~406 GB (original model) ## Use Aria2 for parallelized downloads, links will download 9x faster >>[!TIP]🐧 On Linux `sudo apt install -y aria2` >> >>🍎 On Mac `brew install aria2` >> >>Feel free to paste these all in at once or one at a time ### Q4_0_48 (CPU FMA Optimized Specifically for ARM server chips, NOT TESTED on X86) ```bash aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00001-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00001-of-00006.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00002-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00002-of-00006.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00003-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00003-of-00006.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00004-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00004-of-00006.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00005-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00005-of-00006.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00006-of-00006.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-optimized-q4048-00006-of-00006.gguf ``` ### IQ4_XS Version - Fastest for CPU/GPU should work everywhere (Size: ~212 GB) ```bash aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00001-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00001-of-00005.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00002-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00002-of-00005.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00003-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00003-of-00005.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00004-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00004-of-00005.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00005-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00005-of-00005.gguf ``` ### 1-bit Custom Per Weight Quantization (Size: ~103 GB) ```bash aria2c -x 16 -s 16 -k 1M -o meta-405b-1bit-00001-of-00003.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-1bit-00001-of-00003.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-1bit-00002-of-00003.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-1bit-00002-of-00003.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-1bit-00003-of-00003.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-1bit-00003-of-00003.gguf ``` ### Q2K-Q8 Mixed 2bit 8bit I wrote myself. This is the smallest coherent one I could make WITHOUT imatrix ```verilog aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-2kmix8k-00001-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-2kmix8k-00001-of-00004.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-2kmix8k-00002-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-2kmix8k-00002-of-00004.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-2kmix8k-00003-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-2kmix8k-00003-of-00004.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-2kmix8k-00004-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-inst-cpu-2kmix8k-00004-of-00004.gguf ``` ### Same as above but with higher quality iMatrix Q2K-Q8 (Size: ~154 GB) USE THIS ONE ```bash aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-imatrix-2k-00001-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-imatrix-2k-00001-of-00004.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-imatrix-2k-00002-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-imatrix-2k-00002-of-00004.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-imatrix-2k-00003-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-imatrix-2k-00003-of-00004.gguf aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-imatrix-2k-00004-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-imatrix-2k-00004-of-00004.gguf ```
Q4_0_48 CPU Optimized example response
Q4_0_48 (CPU Optimized) (246GB): Example response of 20000 token prompt
### BF16 Version ```bash aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00001-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00001-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00002-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00002-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00003-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00003-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00004-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00004-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00005-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00005-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00006-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00006-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00007-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00007-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00008-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00008-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00009-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00009-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00010-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00010-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00011-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00011-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00012-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00012-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00013-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00013-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00014-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00014-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00015-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00015-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00016-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00016-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00017-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00017-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00018-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00018-of-00019.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-bf16-00019-of-00019.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-bf16-00019-of-00019.gguf ``` ### Q8_0 Version ```bash aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00001-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00001-of-00010.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00002-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00002-of-00010.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00003-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00003-of-00010.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00004-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00004-of-00010.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00005-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00005-of-00010.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00006-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00006-of-00010.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00007-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00007-of-00010.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00008-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00008-of-00010.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00009-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00009-of-00010.gguf aria2c -x 16 -s 16 -k 1M -o meta-llama-405b-inst-q8_0-00010-of-00010.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-llama-405b-inst-q8_0-00010-of-00010.gguf ``` ## Usage After downloading, you can use these models with libraries like `llama.cpp`. Here's a basic example: ```bash ./llama-cli -t 32 --temp 0.4 -fa -m ~/meow/meta-405b-inst-cpu-optimized-q4048-00001-of-00006.gguf -b 512 -c 9000 -p "Adopt the persona of a NASA JPL mathmatician and firendly helpful programmer." -cnv -co -i ``` ## Model Information This model is based on the [Meta-Llama-3.1-405B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct) model. It's an instruction-tuned version of the 405B parameter Llama 3.1 model, designed for assistant-like chat and various natural language generation tasks. Key features: - 405 billion parameters - Supports 8 languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai - 128k context length - Uses Grouped-Query Attention (GQA) for improved inference scalability For more detailed information about the base model, please refer to the [original model card](https://huggingface.co/meta-llama/Meta-Llama-3.1-405B-Instruct). ## License The use of this model is subject to the [Llama 3.1 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE). Please ensure you comply with the license terms when using this model. ## Acknowledgements Special thanks to the Meta AI team for creating and releasing the Llama 3.1 model series. ## Enjoy; more quants and perplexity benchmarks coming.