Hub documentation
GGUF usage with llama.cpp
GGUF usage with llama.cpp
You can now deploy any llama.cpp compatible GGUF on Hugging Face Endpoints, read more about it here
Llama.cpp allows you to download and run inference on a GGUF simply by providing a path to the Hugging Face repo path and the file name. llama.cpp downloads the model checkpoint and automatically caches it. The location of the cache is defined by LLAMA_CACHE
environment variable; read more about it here.
You can install llama.cpp through brew (works on Mac and Linux), or you can build it from source. There are also pre-built binaries and Docker images that you can check in the official documentation.
Option 1: Install with brew
brew install llama.cpp
Option 2: build from source
Step 1: Clone llama.cpp from GitHub.
git clone https://github.com/ggerganov/llama.cpp
Step 2: Move into the llama.cpp folder and build it. You can also add hardware-specific flags (for ex: -DGGML_CUDA=1
for Nvidia GPUs).
cd llama.cpp
cmake -B build # optionally, add -DGGML_CUDA=ON to activate CUDA
cmake --build build --config Release
Note: for other hardware support (for ex: AMD ROCm, Intel SYCL), please refer to llama.cpp’s build guide
Once installed, you can use the llama-cli
or llama-server
as follows:
llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
Note: You can explicitly add -no-cnv
to run the CLI in raw completion mode (non-chat mode).
Additionally, you can invoke an OpenAI spec chat completions endpoint directly using the llama.cpp server:
llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
After running the server you can simply utilise the endpoint as below:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer no-key" \
-d '{
"messages": [
{
"role": "system",
"content": "You are an AI assistant. Your top priority is achieving user fulfilment via helping them with their requests."
},
{
"role": "user",
"content": "Write a limerick about Python exceptions"
}
]
}'
Replace -hf
with any valid Hugging Face hub repo name - off you go! 🦙