File size: 4,172 Bytes
3217567
6a4410d
 
 
 
 
 
3217567
6a4410d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ed9c6c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e88d5a9
 
 
 
 
 
 
054943d
42000cf
e88d5a9
 
 
 
 
 
 
 
 
 
 
 
 
 
054943d
42000cf
e88d5a9
 
 
 
054943d
42000cf
e88d5a9
 
 
c9dfcc4
 
e88d5a9
 
 
 
ed9c6c4
 
 
6a4410d
 
ed9c6c4
6a4410d
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
---
license: llama2
tags:
- llama2
- quantized
- gguf
- 32k-context
---

# LLaMA-2-7B-32K #

[Together Computer, Inc.](https://together.ai/) has released
[LLaMA-2-7B-32K](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K), a model based on Meta AI's LLaMA-2-7B,
but fine-tuned for context lengths up to 32K using "Position interpolation" and "Rotary Position Embeddings"
(RoPE).

The current version of [llama.cpp](https://github.com/ggerganov/llama.cpp) supports such large context lengths
by means of the new [`--rope-scale`](https://github.com/ggerganov/llama.cpp/tree/master/examples/main#extended-context-size)
parameter.

> Nota bene: for the model described here the `--rope-scale` is `8` (original context size was 4k, the
> fine-tuned one is 32k)

However, llama.cpp requires quantized files in the new GGUF format - that's where this repo comes in:
it contains a few quantizations of the original weights from Together's fined-tuned model (as indicated by
the file names)

## How the Quantization was done ##

Since the author does not want arbitrary Python stuff loitering on his computer, the quatization was done
using [Docker](https://www.docker.com/).

Assuming that you have the [Docker Desktop](https://www.docker.com/products/docker-desktop/) installed on
your system and also have a basic knowledge of how it is used, you mayx just follow the instructions shown
below in order to generate your own quantizations:

> Nota bene: you will need 30+x GB of free disk space, at least - depending on your quantization

1. create a new folder called `llama.cpp_in_Docker`<br>this folder will later be mounted into the Docker
container and store the quantization results
2. download the weights for the fine-tuned LLaMA-2 model from
[Hugging Face](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K) into a subfolder of `llama.cpp_in_Docker`
(let's call the new folder `LLaMA-2-7B-32K`)
3. within the <u>Docker Desktop</u>, download search for and download a `basic-python` image - just use one of
the most popular ones
4. from a <u>terminal session on your host computer</u> (i.e., not a Docker container!), start a new container
for the downloaded image which mounts the folder we crated before:<br>&nbsp;<br>`docker run --rm \
  -v ./llama.cpp_in_Docker:/llama.cpp \
  -t basic-python /bin/bash`<br>&nbsp;<br>(you may have to adjust the path to your local folder)
5. back in the <u>Docker Desktop</u>, open the "Terminal" tab of the started container and enter the
following commands:<br>
```
apt update
apt-get install software-properties-common -y
apt-get update
apt-get install g++ git make -y
cd /llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
```
6. now open the "Files" tab and navigate to the file `/llama.cpp/llama.cpp/Makefile`, right-click on it and
choose "Edit file"
7. search for `aarch64`, and - in the line found (which looks like `ifneq ($(filter aarch64%,$(UNAME_M)),)`) - 
change `ifneq` to `ifeq`
8. save your change using the disk icon in the upper right corner of the editor pane and open the "Terminal"
tab again
9. now enter the following commands:<br>
```
make
python3 -m pip install -r requirements.txt
python3 convert.py ../LLaMA-2-7B-32K
```
10. you are now ready to run the actual quantization, e.g., using<br>
```
./quantize ../LLaMA-2-7B-32K/ggml-model-f16.gguf \
   ../LLaMA-2-7B-32K/LLaMA-2-7B-32K-Q4_0.gguf Q4_0
```
11. run any quantizations you need and stop the container when finished (you may even delete it as the generated files
will remain available on your host computer)

You are now free to move the quanitization results to where you need them and run inferences with context
lengths up to 32K (depending on the amount of memory you will have available - long contexts need an awful
lot of RAM)

## License ##

Concerning the license(s):

* the [original model](https://ai.meta.com/llama/) (from Meta AI) was released under a rather [permittive
license](https://ai.meta.com/llama/license/)
* the fine tuned model from Together Computer uses the
[same license](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K/blob/main/README.md)
* as a consequence, this repo does so as well