|
--- |
|
license: llama2 |
|
tags: |
|
- llama2 |
|
- quantized |
|
- gguf |
|
- 32k-context |
|
--- |
|
|
|
# LLaMA-2-7B-32K # |
|
|
|
[Together Computer, Inc.](https://together.ai/) has released |
|
[LLaMA-2-7B-32K](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K), a model based on Meta AI's LLaMA-2-7B, |
|
but fine-tuned for context lengths up to 32K using "Position interpolation" and "Rotary Position Embeddings" |
|
(RoPE). |
|
|
|
The current version of [llama.cpp](https://github.com/ggerganov/llama.cpp) supports such large context lengths |
|
by means of the new [`--rope-scale`](https://github.com/ggerganov/llama.cpp/tree/master/examples/main#extended-context-size) |
|
parameter. |
|
|
|
> Nota bene: for the model described here the `--rope-scale` is `8` (original context size was 4k, the |
|
> fine-tuned one is 32k) |
|
|
|
However, llama.cpp requires quantized files in the new GGUF format - that's where this repo comes in: |
|
it contains a few quantizations of the original weights from Together's fined-tuned model (as indicated by |
|
the file names) |
|
|
|
## How the Quantization was done ## |
|
|
|
Since the author does not want arbitrary Python stuff loitering on his computer, the quatization was done |
|
using [Docker](https://www.docker.com/). |
|
|
|
Assuming that you have the [Docker Desktop](https://www.docker.com/products/docker-desktop/) installed on |
|
your system and also have a basic knowledge of how it is used, you may just follow the instructions shown |
|
below in order to generate your own quantizations: |
|
|
|
> Nota bene: you will need 30+x GB of free disk space, at least - depending on your quantization |
|
|
|
1. create a new folder called `llama.cpp_in_Docker`<br>this folder will later be mounted into the Docker |
|
container and store the quantization results |
|
2. download the weights for the fine-tuned LLaMA-2 model from |
|
[Hugging Face](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K) into a subfolder of `llama.cpp_in_Docker` |
|
(let's call the new folder `LLaMA-2-7B-32K`) |
|
3. within the <u>Docker Desktop</u>, search for and download a `basic-python` image - just use one of |
|
the most popular ones |
|
4. from a <u>terminal session on your host computer</u> (i.e., not a Docker container!), start a new container |
|
for the downloaded image which mounts the folder we crated before:<br> <br>`docker run --rm \ |
|
-v ./llama.cpp_in_Docker:/llama.cpp \ |
|
-t basic-python /bin/bash`<br> <br>(you may have to adjust the path to your local folder) |
|
5. back in the <u>Docker Desktop</u>, open the "Terminal" tab of the started container and enter the |
|
following commands:<br> |
|
``` |
|
apt update |
|
apt-get install software-properties-common -y |
|
apt-get update |
|
apt-get install g++ git make -y |
|
cd /llama.cpp |
|
git clone https://github.com/ggerganov/llama.cpp |
|
cd llama.cpp |
|
``` |
|
6. now open the "Files" tab and navigate to the file `/llama.cpp/llama.cpp/Makefile`, right-click on it and |
|
choose "Edit file" |
|
7. search for `aarch64`, and - in the line found (which looks like `ifneq ($(filter aarch64%,$(UNAME_M)),)`) - |
|
change `ifneq` to `ifeq` |
|
8. save your change using the disk icon in the upper right corner of the editor pane and open the "Terminal" |
|
tab again |
|
9. now enter the following commands:<br> |
|
``` |
|
make |
|
python3 -m pip install -r requirements.txt |
|
python3 convert.py ../LLaMA-2-7B-32K |
|
``` |
|
10. you are now ready to run the actual quantization, e.g., using<br> |
|
``` |
|
./quantize ../LLaMA-2-7B-32K/ggml-model-f16.gguf \ |
|
../LLaMA-2-7B-32K/LLaMA-2-7B-32K-Q4_0.gguf Q4_0 |
|
``` |
|
11. run any quantizations you need and stop the container when finished (you may even delete it as the |
|
generated files will remain available on your host computer) |
|
12. the `basic-python` image may also be deleted unless you plan to use it again in the near future |
|
|
|
You are now free to move the quanitization results to where you need them and run inferences with context |
|
lengths up to 32K (depending on the amount of memory you will have available - long contexts need an awful |
|
lot of RAM) |
|
|
|
## License ## |
|
|
|
Concerning the license(s): |
|
|
|
* the [original model](https://ai.meta.com/llama/) (from Meta AI) was released under a rather [permittive |
|
license](https://ai.meta.com/llama/license/) |
|
* the fine tuned model from Together Computer uses the |
|
[same license](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K/blob/main/README.md) |
|
* as a consequence, this repo does so as well |