Update README.md
Browse files
README.md
CHANGED
@@ -41,11 +41,43 @@ container and store the quantization results
|
|
41 |
2. download the weights for the fine-tuned LLaMA-2 model from
|
42 |
[Hugging Face](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K) into a subfolder of `llama.cpp_in_Docker`
|
43 |
(let's call the new folder `LLaMA-2-7B-32K`)
|
44 |
-
3. within the Docker Desktop
|
45 |
-
most popular ones
|
46 |
-
4. from a terminal session on your host computer (i.e., not a Docker container!), start a new container
|
47 |
-
downloaded image which mounts the folder we crated before:<br> <br
|
48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
49 |
|
50 |
## License ##
|
51 |
|
|
|
41 |
2. download the weights for the fine-tuned LLaMA-2 model from
|
42 |
[Hugging Face](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K) into a subfolder of `llama.cpp_in_Docker`
|
43 |
(let's call the new folder `LLaMA-2-7B-32K`)
|
44 |
+
3. within the <u>Docker Desktop</u>, download search for and download a `basic-python` image - just use one of
|
45 |
+
the most popular ones
|
46 |
+
4. from a <u>terminal session on your host computer</u> (i.e., not a Docker container!), start a new container
|
47 |
+
for the downloaded image which mounts the folder we crated before:<br> <br>`docker run --rm \
|
48 |
+
-v ./llama.cpp_in_Docker:/llama.cpp \
|
49 |
+
-t basic-python /bin/bash`<br> <br>(you may have to adjust the path to your local folder)
|
50 |
+
5. back in the <u>Docker Desktop</u>, open the "Terminal" tab of the started container and enter the
|
51 |
+
following commands:<br> <br>```
|
52 |
+
apt update
|
53 |
+
apt-get install software-properties-common -y
|
54 |
+
apt-get update
|
55 |
+
apt-get install g++ git make -y
|
56 |
+
cd /llama.cpp
|
57 |
+
git clone https://github.com/ggerganov/llama.cpp
|
58 |
+
cd llama.cpp
|
59 |
+
```
|
60 |
+
6. now open the "Files" tab and navigate to the file `/llama.cpp/llama.cpp/Makefile`, right-click on it and
|
61 |
+
choose "Edit file"
|
62 |
+
7. search for `aarch64`, and - in the line found (which looks like `ifneq ($(filter aarch64%,$(UNAME_M)),)`) -
|
63 |
+
change `ifneq` to `ifeq`
|
64 |
+
8. save your change using the disk icon in the upper right corner of the editor pane and open the "Terminal"
|
65 |
+
tab again
|
66 |
+
9. now enter the following commands:<br> <br>```
|
67 |
+
make
|
68 |
+
python3 -m pip install -r requirements.txt
|
69 |
+
python3 convert.py ../LLaMA-2-7B-32K
|
70 |
+
```
|
71 |
+
10. you are now ready to run the actual quantization, e.g., using<br> <br>```
|
72 |
+
./quantize ../LLaMA-2-7B-32K/ggml-model-f16.gguf \
|
73 |
+
../LLaMA-2-7B-32K/LLaMA-2-7B-32K-Q4_0.gguf Q4_0
|
74 |
+
```
|
75 |
+
11. run any quantizations you need and stop the container again (you may even delete it as the generated files
|
76 |
+
will remain available on your host computer
|
77 |
+
|
78 |
+
You are now free to move the quanitization results to where you need them and run inferences with context
|
79 |
+
lengths up to 32K (depending on the amount of memory you will have available - long contexts need an awful
|
80 |
+
lot of RAM)
|
81 |
|
82 |
## License ##
|
83 |
|