nisten
/

meta-405b-instruct-cpu-optimized-gguf

GGUF

Inference Endpoints

Model card Files Files and versions Community

nisten commited on Jul 24

Commit

4072371

•

1 Parent(s): 6329a82

Update README.md

Browse files

Files changed (1) hide show

README.md +16 -7

README.md CHANGED Viewed

@@ -9,10 +9,15 @@ This repository contains CPU-optimized GGUF quantizations of the Meta-Llama-3.1-
 ## Available Quantizations
 1. Q4_0_4_8 (CPU FMA-Optimized): ~246 GB
-2. BF16: ~811 GB
-3. Q8_0: ~406 GB
-4. Q2-Q8 (custom quant I wrote) ~ 165 GB
 ## Use Aria2 for parallelized downloads, links will download 9x faster
@@ -22,8 +27,7 @@ This repository contains CPU-optimized GGUF quantizations of the Meta-Llama-3.1-
 >>
 >>Feel free to paste these all in at once or one at a time
-### Q4_0_48 (CPU Optimized) Example response of 20000 token prompt:
-![image/png](https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/DD71wAB7DlQBmTG8wVaWS.png)
 ```bash
@@ -36,7 +40,7 @@ aria2c -x 16 -s 16 -k 1M -o meta-405b-inst-cpu-optimized-q4048-00006-of-00006.gg
 ```
-### IQ4_XS Version - Fastest for CPU/GPU (Size: ~212 GB)
 ```bash
 aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00001-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00001-of-00005.gguf
 aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00002-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00002-of-00005.gguf
@@ -52,7 +56,7 @@ aria2c -x 16 -s 16 -k 1M -o meta-405b-1bit-00002-of-00003.gguf https://huggingfa
 aria2c -x 16 -s 16 -k 1M -o meta-405b-1bit-00003-of-00003.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-1bit-00003-of-00003.gguf
 ```
-Note: Sizes are approximate and converted to GB (1 GB = 1024 MiB).
 ### Q2K-Q8 Mixed 2bit 8bit I wrote myself. This is the smallest coherent one I could make WITHOUT imatrix
 ```verilog
@@ -70,6 +74,11 @@ aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-imatrix-2k-00003-of-00004.gguf https:/
 aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-imatrix-2k-00004-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-imatrix-2k-00004-of-00004.gguf
 ```
 ### BF16 Version
 ```bash

 ## Available Quantizations
+Available Quantizations
 1. Q4_0_4_8 (CPU FMA-Optimized): ~246 GB
+2. IQ4_XS (Fastest for CPU/GPU): ~212 GB
+3. Q2K-Q8 Mixed quant with iMatrix: ~154 GB
+4. Q2K-Q8 Mixed without iMat for testing: ~165 GB
+5. 1-bit Custom per weight COHERENT quant: ~103 GB
+6. BF16: ~811 GB (original model)
+7. Q8_0: ~406 GB (original model)
 ## Use Aria2 for parallelized downloads, links will download 9x faster
 >>
 >>Feel free to paste these all in at once or one at a time
+### Q4_0_48 (CPU FMA Optimized Specifically for ARM server chips, NOT TESTED on X86)
 ```bash
 ```
+### IQ4_XS Version - Fastest for CPU/GPU should work everywhere (Size: ~212 GB)
 ```bash
 aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00001-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00001-of-00005.gguf
 aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-i1-q4xs-00002-of-00005.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-i1-q4xs-00002-of-00005.gguf
 aria2c -x 16 -s 16 -k 1M -o meta-405b-1bit-00003-of-00003.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-1bit-00003-of-00003.gguf
 ```
 ### Q2K-Q8 Mixed 2bit 8bit I wrote myself. This is the smallest coherent one I could make WITHOUT imatrix
 ```verilog
 aria2c -x 16 -s 16 -k 1M -o meta-405b-cpu-imatrix-2k-00004-of-00004.gguf https://huggingface.co/nisten/meta-405b-instruct-cpu-optimized-gguf/resolve/main/meta-405b-cpu-imatrix-2k-00004-of-00004.gguf
 ```
+<figure>
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6379683a81c1783a4a2ddba8/DD71wAB7DlQBmTG8wVaWS.png" alt="Q4_0_48 CPU Optimized example response">
+ <figcaption><strong>Q4_0_48 (CPU Optimized) (246GB):</strong> Example response of 20000 token prompt</figcaption>
+</figure>
 ### BF16 Version
 ```bash