rozek commited on
Commit
0b71789
1 Parent(s): c8d570c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +125 -0
README.md CHANGED
@@ -1,3 +1,128 @@
1
  ---
2
  license: llama2
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: llama2
3
+ tags:
4
+ - llama
5
+ - llama-2
6
+ - facebook
7
+ - meta
8
+ - text-generation-inference
9
+ - quantized
10
+ - gguf
11
+ - 32k-context
12
+ - togethercomputer
13
+ language:
14
+ - en
15
+ pipeline_tag: text-generation
16
  ---
17
+
18
+ # LLaMA-2-7B-32K-Instruct_GGUF #
19
+
20
+ [Together Computer, Inc.](https://together.ai/) has released
21
+ [Llama-2-7B-32K-Instruct](https://huggingface.co/togethercomputer/Llama-2-7B-32K-Instruct), a model based on
22
+ [Meta AI](https://ai.meta.com)'s [LLaMA-2-7B](https://huggingface.co/meta-llama/Llama-2-7b),
23
+ but fine-tuned for context lengths up to 32K using "Position Interpolation" and "Rotary Position Embeddings"
24
+ (RoPE).
25
+
26
+ While the current version of [llama.cpp](https://github.com/ggerganov/llama.cpp) already supports such large
27
+ context lengths, it requires quantized files in the new GGUF format - and that's where this repo comes in:
28
+ it contains the following quantizations of the original weights from Together's fined-tuned model
29
+
30
+ * [Q2_K](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q2_K.gguf)
31
+ * [Q3_K_S](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q3_K_S.gguf),
32
+ [Q3_K_M](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q3_K_M.gguf) (aka Q3_K) and
33
+ [Q3_K_L](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q3_K_L.gguf)
34
+ * ~~[Q4_0](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q4_0.gguf)~~,
35
+ ~~[Q4_1](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q4_1.gguf)~~,
36
+ ~~[Q4_K_S](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q4_K_S.gguf)~~ and
37
+ ~~[Q4_K_M](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q4_K_M.gguf)~~ (aka Q4_K)
38
+ * ~~[Q5_0](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q5_0.gguf)~~,
39
+ ~~[Q5_1](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q5_1.gguf)~~,
40
+ ~~[Q5_K_S](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q5_K_S.gguf)~~ and
41
+ ~~[Q5_K_M](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q5_K_M.gguf)~~ (aka Q5_K)
42
+ * ~~[Q6_K](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q6_K.gguf)~~,
43
+ * ~~[Q8_0](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-Q8_0.gguf)~~ and
44
+ * ~~[F16](https://huggingface.co/rozek/LLaMA-2-7B-32K-Instruct_GGUF/blob/main/LLaMA-2-7B-32K-Instruct-f16.gguf)~~ (unquantized)
45
+
46
+ > Nota bene: while RoPE makes inferences with large contexts possible, you still need an awful lot of RAM
47
+ > when doing so. And since "32K" does not mean that you always have to use a context size of 32768 (only that
48
+ > the model was fine-tuned for that size), it is recommended that you keep your context as small as possible
49
+
50
+ > If you need quantizations for Together Computer's
51
+ > [Llama-2-7B-32K](https://huggingface.co/togethercomputer/Llama-2-7B-32K)
52
+ > model, then look for
53
+ > [LLaMA-2-7B-32K_GGUF](https://huggingface.co/rozek/LLaMA-2-7B-32K_GGUF)
54
+
55
+ ## How Quantization was done ##
56
+
57
+ Since the author does not want arbitrary Python stuff to loiter on his computer, the quantization was done
58
+ using [Docker](https://www.docker.com/).
59
+
60
+ Assuming that you have the [Docker Desktop](https://www.docker.com/products/docker-desktop/) installed on
61
+ your system and also have a basic knowledge of how to use it, you may just follow the instructions shown
62
+ below in order to generate your own quantizations:
63
+
64
+ > Nota bene: you will need 30+x GB of free disk space, at least - depending on your quantization
65
+
66
+ 1. create a new folder called `llama.cpp_in_Docker`<br>this folder will later be mounted into the Docker
67
+ container and store the quantization results
68
+ 2. download the weights for the fine-tuned LLaMA-2 model from
69
+ [Hugging Face](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K-Instruct) into a subfolder of
70
+ `llama.cpp_in_Docker` (let's call the new folder `LLaMA-2-7B-32K-Instruct`)
71
+ 4. within the <u>Docker Desktop</u>, search for and download a `basic-python` image - just use one of
72
+ the most popular ones
73
+ 5. from a <u>terminal session on your host computer</u> (i.e., not a Docker container!), start a new container
74
+ for the downloaded image which mounts the folder we created before:<br>
75
+ ```
76
+ docker run --rm \
77
+ -v ./llama.cpp_in_Docker:/llama.cpp \
78
+ -t basic-python /bin/bash
79
+ ```
80
+
81
+ (you may have to adjust the path to your local folder)
82
+
83
+ 5. back in the <u>Docker Desktop</u>, open the "Terminal" tab of the started container and enter the
84
+ following commands (one after the other - copying the complete list and pasting it into the terminal
85
+ as a whole does not always seems to work properly):<br>
86
+ ```
87
+ apt update
88
+ apt-get install software-properties-common -y
89
+ apt-get update
90
+ apt-get install g++ git make -y
91
+ cd /llama.cpp
92
+ git clone https://github.com/ggerganov/llama.cpp
93
+ cd llama.cpp
94
+ ```
95
+ 6. now open the "Files" tab and navigate to the file `/llama.cpp/llama.cpp/Makefile`, right-click on it and
96
+ choose "Edit file"
97
+ 7. search for `aarch64`, and - in the line found (which looks like `ifneq ($(filter aarch64%,$(UNAME_M)),)`) -
98
+ change `ifneq` to `ifeq`
99
+ 8. save your change using the disk icon in the upper right corner of the editor pane and open the "Terminal"
100
+ tab again
101
+ 9. now enter the following commands:<br>
102
+ ```
103
+ make
104
+ python3 -m pip install -r requirements.txt
105
+ python3 convert.py ../LLaMA-2-7B-32K-Instruct
106
+ ```
107
+ 10. you are now ready to run the actual quantization, e.g., using<br>
108
+ ```
109
+ ./quantize ../LLaMA-2-7B-32K-Instruct/ggml-model-f16.gguf \
110
+ ../LLaMA-2-7B-32K/LLaMA-2-7B-32K-Instruct-Q4_0.gguf Q4_0
111
+ ```
112
+ 11. run any quantizations you need and stop the container when finished (the container will automatically
113
+ be deleted but the generated files will remain available on your host computer)
114
+ 12. the `basic-python` image may also be deleted (manually) unless you plan to use it again in the near future
115
+
116
+ You are now free to move the quanitization results to where you need them and run inferences with context
117
+ lengths up to 32K (depending on the amount of memory you will have available - long contexts need a
118
+ lot of RAM)
119
+
120
+ ## License ##
121
+
122
+ Concerning the license(s):
123
+
124
+ * the [original model](https://ai.meta.com/llama/) (from Meta AI) was released under a rather [permissive
125
+ license](https://ai.meta.com/llama/license/)
126
+ * the fine tuned model from Together Computer uses the
127
+ [same license](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K-Instruct/blob/main/README.md)
128
+ * as a consequence, this repo does so as well