GGUF quants for xverse/XVERSE-7B using llama.cpp

Terms of Use: Please check the original model

cthulhu

Quants

  • q2_k: Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors.
  • q3_k_s: Uses Q3_K for all tensors
  • q3_k_m: Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
  • q3_k_l: Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K
  • q4_0: Original quant method, 4-bit.
  • q4_1: Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models.
  • q4_k_s: Uses Q4_K for all tensors
  • q4_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K
  • q5_0: Higher accuracy, higher resource usage and slower inference.
  • q5_1: Even higher accuracy, resource usage and slower inference.
  • q5_k_s: Uses Q5_K for all tensors
  • q5_k_m: Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K
  • q6_k: Uses Q8_K for all tensors
  • q8_0: Almost indistinguishable from float16. High resource use and slow. Not recommended for most users.
Downloads last month
50
GGUF
Model size
7.3B params
Architecture
llama
Hardware compatibility
Log In to view the estimation

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including neopolita/xverse-7b-gguf