This repository contains 2-bit quantized LLaMA-v2 models in GGUF format for use with llama.cpp. All tensors are quantized with Q2_K, except for output.weight, which is Q6_K, and, in the case of LLaMA-v2-70B, attn_v, which is Q4_K. The quantized models differ from the standard llama.cpp 2-bit quantization in two ways:

  • These are actual 2-bit quantized models instead of the mostly 3-bit quantization provided by the standard llama.cpp Q2_K quantization method
  • The models were prepared with a refined (but not yet published) k-quants quantization approach
Downloads last month
22
GGUF
Model size
13B params
Architecture
llama
Hardware compatibility
Log In to view the estimation
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support