Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,128 @@
|
|
1 |
---
|
2 |
license: llama2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: llama2
|
3 |
+
tags:
|
4 |
+
- llama
|
5 |
+
- llama-2
|
6 |
+
- facebook
|
7 |
+
- meta
|
8 |
+
- text-generation-inference
|
9 |
+
- quantized
|
10 |
+
- gguf
|
11 |
+
- 32k-context
|
12 |
+
- togethercomputer
|
13 |
+
language:
|
14 |
+
- en
|
15 |
+
pipeline_tag: text-generation
|
16 |
---
|
17 |
+
|
18 |
+
# LLaMA-2-7B-32K-Instruct_GGUF #
|
19 |
+
|
20 |
+
[Together Computer, Inc.](https://together.ai/) has released
|
21 |
+
[Llama-2-7B-32K-Instruct](https://huggingface.co/togethercomputer/Llama-2-7B-32K-Instruct), a model based on
|
22 |
+
[Meta AI](https://ai.meta.com)'s [LLaMA-2-7B](https://huggingface.co/meta-llama/Llama-2-7b),
|
23 |
+
but fine-tuned for context lengths up to 32K using "Position Interpolation" and "Rotary Position Embeddings"
|
24 |
+
(RoPE).
|
25 |
+
|
26 |
+
While the current version of [llama.cpp](https://github.com/ggerganov/llama.cpp) already supports such large
|
27 |
+
context lengths, it requires quantized files in the new GGUF format - and that's where this repo comes in:
|
28 |
+
it contains the following quantizations of the original weights from Together's fined-tuned model
|
29 |
+
|
30 |
+
* [Q2_K](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q2_K.gguf)
|
31 |
+
* [Q3_K_S](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q3_K_S.gguf),
|
32 |
+
[Q3_K_M](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q3_K_M.gguf) (aka Q3_K) and
|
33 |
+
[Q3_K_L](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q3_K_L.gguf)
|
34 |
+
* ~~[Q4_0](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q4_0.gguf)~~,
|
35 |
+
~~[Q4_1](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q4_1.gguf)~~,
|
36 |
+
~~[Q4_K_S](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q4_K_S.gguf)~~ and
|
37 |
+
~~[Q4_K_M](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q4_K_M.gguf)~~ (aka Q4_K)
|
38 |
+
* ~~[Q5_0](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q5_0.gguf)~~,
|
39 |
+
~~[Q5_1](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q5_1.gguf)~~,
|
40 |
+
~~[Q5_K_S](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q5_K_S.gguf)~~ and
|
41 |
+
~~[Q5_K_M](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q5_K_M.gguf)~~ (aka Q5_K)
|
42 |
+
* ~~[Q6_K](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q6_K.gguf)~~,
|
43 |
+
* ~~[Q8_0](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q8_0.gguf)~~ and
|
44 |
+
* ~~[F16](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-f16.gguf)~~ (unquantized)
|
45 |
+
|
46 |
+
> Nota bene: while RoPE makes inferences with large contexts possible, you still need an awful lot of RAM
|
47 |
+
> when doing so. And since "32K" does not mean that you always have to use a context size of 32768 (only that
|
48 |
+
> the model was fine-tuned for that size), it is recommended that you keep your context as small as possible
|
49 |
+
|
50 |
+
> If you need quantizations for Together Computer's
|
51 |
+
> [Llama-2-7B-32K](https://huggingface.co/togethercomputer/Llama-2-7B-32K)
|
52 |
+
> model, then look for
|
53 |
+
> [LLaMA-2-7B-32K_GGUF](https://huggingface.co/rozek/LLaMA-2-7B-32K_GGUF)
|
54 |
+
|
55 |
+
## How Quantization was done ##
|
56 |
+
|
57 |
+
Since the author does not want arbitrary Python stuff to loiter on his computer, the quantization was done
|
58 |
+
using [Docker](https://www.docker.com/).
|
59 |
+
|
60 |
+
Assuming that you have the [Docker Desktop](https://www.docker.com/products/docker-desktop/) installed on
|
61 |
+
your system and also have a basic knowledge of how to use it, you may just follow the instructions shown
|
62 |
+
below in order to generate your own quantizations:
|
63 |
+
|
64 |
+
> Nota bene: you will need 30+x GB of free disk space, at least - depending on your quantization
|
65 |
+
|
66 |
+
1. create a new folder called `llama.cpp_in_Docker`<br>this folder will later be mounted into the Docker
|
67 |
+
container and store the quantization results
|
68 |
+
2. download the weights for the fine-tuned LLaMA-2 model from
|
69 |
+
[Hugging Face](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K-Instruct) into a subfolder of
|
70 |
+
`llama.cpp_in_Docker` (let's call the new folder `LLaMA-2-7B-32K-Instruct`)
|
71 |
+
4. within the <u>Docker Desktop</u>, search for and download a `basic-python` image - just use one of
|
72 |
+
the most popular ones
|
73 |
+
5. from a <u>terminal session on your host computer</u> (i.e., not a Docker container!), start a new container
|
74 |
+
for the downloaded image which mounts the folder we created before:<br>
|
75 |
+
```
|
76 |
+
docker run --rm \
|
77 |
+
-v ./llama.cpp_in_Docker:/llama.cpp \
|
78 |
+
-t basic-python /bin/bash
|
79 |
+
```
|
80 |
+
|
81 |
+
(you may have to adjust the path to your local folder)
|
82 |
+
|
83 |
+
5. back in the <u>Docker Desktop</u>, open the "Terminal" tab of the started container and enter the
|
84 |
+
following commands (one after the other - copying the complete list and pasting it into the terminal
|
85 |
+
as a whole does not always seems to work properly):<br>
|
86 |
+
```
|
87 |
+
apt update
|
88 |
+
apt-get install software-properties-common -y
|
89 |
+
apt-get update
|
90 |
+
apt-get install g++ git make -y
|
91 |
+
cd /llama.cpp
|
92 |
+
git clone https://github.com/ggerganov/llama.cpp
|
93 |
+
cd llama.cpp
|
94 |
+
```
|
95 |
+
6. now open the "Files" tab and navigate to the file `/llama.cpp/llama.cpp/Makefile`, right-click on it and
|
96 |
+
choose "Edit file"
|
97 |
+
7. search for `aarch64`, and - in the line found (which looks like `ifneq ($(filter aarch64%,$(UNAME_M)),)`) -
|
98 |
+
change `ifneq` to `ifeq`
|
99 |
+
8. save your change using the disk icon in the upper right corner of the editor pane and open the "Terminal"
|
100 |
+
tab again
|
101 |
+
9. now enter the following commands:<br>
|
102 |
+
```
|
103 |
+
make
|
104 |
+
python3 -m pip install -r requirements.txt
|
105 |
+
python3 convert.py ../LLaMA-2-7B-32K-Instruct
|
106 |
+
```
|
107 |
+
10. you are now ready to run the actual quantization, e.g., using<br>
|
108 |
+
```
|
109 |
+
./quantize ../LLaMA-2-7B-32K-Instruct/ggml-model-f16.gguf \
|
110 |
+
../LLaMA-2-7B-32K/LLaMA-2-7B-32K-Instruct-Q4_0.gguf Q4_0
|
111 |
+
```
|
112 |
+
11. run any quantizations you need and stop the container when finished (the container will automatically
|
113 |
+
be deleted but the generated files will remain available on your host computer)
|
114 |
+
12. the `basic-python` image may also be deleted (manually) unless you plan to use it again in the near future
|
115 |
+
|
116 |
+
You are now free to move the quanitization results to where you need them and run inferences with context
|
117 |
+
lengths up to 32K (depending on the amount of memory you will have available - long contexts need a
|
118 |
+
lot of RAM)
|
119 |
+
|
120 |
+
## License ##
|
121 |
+
|
122 |
+
Concerning the license(s):
|
123 |
+
|
124 |
+
* the [original model](https://ai.meta.com/llama/) (from Meta AI) was released under a rather [permissive
|
125 |
+
license](https://ai.meta.com/llama/license/)
|
126 |
+
* the fine tuned model from Together Computer uses the
|
127 |
+
[same license](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K-Instruct/blob/main/README.md)
|
128 |
+
* as a consequence, this repo does so as well
|