metadata

license: llama2
tags:
  - llama2
  - quantized
  - gguf
  - 32k-context

LLaMA-2-7B-32K

Together Computer, Inc. has released LLaMA-2-7B-32K, a model based on Meta AI's LLaMA-2-7B, but fine-tuned for context lengths up to 32K using "Position interpolation" and "Rotary Position Embeddings" (RoPE).

The current version of llama.cpp supports such large context lengths by means of the new --rope-scale parameter.

Nota bene: for the model described here the --rope-scale is 8 (original context size was 4k, the fine-tuned one is 32k)

However, llama.cpp requires quantized files in the new GGUF format - that's where this repo comes in: it contains a few quantizations of the original weights from Together's fined-tuned model (as indicated by the file names)

How the Quantization was done

Since the author does not want arbitrary Python stuff loitering on his computer, the quatization was done using Docker.

Assuming that you have the Docker Desktop installed on your system and also have a basic knowledge of how it is used, you mayx just follow the instructions shown below in order to generate your own quantizations:

Nota bene: you will need 30+x GB of free disk space, at least - depending on your quantization

create a new folder called llama.cpp_in_Docker
this folder will later be mounted into the Docker container and store the quantization results
download the weights for the fine-tuned LLaMA-2 model from Hugging Face into a subfolder of llama.cpp_in_Docker (let's call the new folder LLaMA-2-7B-32K)
within the Docker Desktop, download search for and download a basic-python image - just use one of the most popular ones
from a terminal session on your host computer (i.e., not a Docker container!), start a new container for the downloaded image which mounts the folder we crated before:

docker run --rm \ -v ./llama.cpp_in_Docker:/llama.cpp \ -t basic-python /bin/bash

(you may have to adjust the path to your local folder)
back in the Docker Desktop, open the "Terminal" tab of the started container and enter the following commands:

apt update
apt-get install software-properties-common -y
apt-get update
apt-get install g++ git make -y
cd /llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

now open the "Files" tab and navigate to the file /llama.cpp/llama.cpp/Makefile, right-click on it and choose "Edit file"
search for aarch64, and - in the line found (which looks like ifneq ($(filter aarch64%,$(UNAME_M)),)) - change ifneq to ifeq
save your change using the disk icon in the upper right corner of the editor pane and open the "Terminal" tab again
now enter the following commands:

make
python3 -m pip install -r requirements.txt
python3 convert.py ../LLaMA-2-7B-32K

you are now ready to run the actual quantization, e.g., using

./quantize ../LLaMA-2-7B-32K/ggml-model-f16.gguf \
   ../LLaMA-2-7B-32K/LLaMA-2-7B-32K-Q4_0.gguf Q4_0

run any quantizations you need and stop the container when finished (you may even delete it as the generated files will remain available on your host computer)
the basic-python image may also be deleted unless you plan to use it again in the near future

You are now free to move the quanitization results to where you need them and run inferences with context lengths up to 32K (depending on the amount of memory you will have available - long contexts need an awful lot of RAM)

License

Concerning the license(s):

the original model (from Meta AI) was released under a rather permittive license
the fine tuned model from Together Computer uses the same license
as a consequence, this repo does so as well