rozek
/

LLaMA-2-7B-32K_GGUF

Text Generation

text-generation-inference

togethercomputer

Inference Endpoints

Model card Files Files and versions Community

rozek commited on Aug 28, 2023

Commit

e88d5a9

·

1 Parent(s): ed9c6c4

Update README.md

Files changed (1) hide show

README.md +37 -5

README.md CHANGED Viewed

@@ -41,11 +41,43 @@ container and store the quantization results
 2. download the weights for the fine-tuned LLaMA-2 model from
 [Hugging Face](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K) into a subfolder of `llama.cpp_in_Docker`
 (let's call the new folder `LLaMA-2-7B-32K`)
-3. within the Docker Desktop, download search for and download a `basic-python` image - just use one of the
-most popular ones
-4. from a terminal session on your host computer (i.e., not a Docker container!), start a new container for the
-downloaded image which mounts the folder we crated before:<br>&nbsp;<br>``
-...
 ## License ##

 2. download the weights for the fine-tuned LLaMA-2 model from
 [Hugging Face](https://huggingface.co/togethercomputer/LLaMA-2-7B-32K) into a subfolder of `llama.cpp_in_Docker`
 (let's call the new folder `LLaMA-2-7B-32K`)
+3. within the <u>Docker Desktop</u>, download search for and download a `basic-python` image - just use one of
+the most popular ones
+4. from a <u>terminal session on your host computer</u> (i.e., not a Docker container!), start a new container
+for the downloaded image which mounts the folder we crated before:<br>&nbsp;<br>`docker run --rm \
+  -v ./llama.cpp_in_Docker:/llama.cpp \
+  -t basic-python /bin/bash`<br>&nbsp;<br>(you may have to adjust the path to your local folder)
+5. back in the <u>Docker Desktop</u>, open the "Terminal" tab of the started container and enter the
+following commands:<br>&nbsp;<br>```
+apt update
+apt-get install software-properties-common -y
+apt-get update
+apt-get install g++ git make -y
+cd /llama.cpp
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+```
+6. now open the "Files" tab and navigate to the file `/llama.cpp/llama.cpp/Makefile`, right-click on it and
+choose "Edit file"
+7. search for `aarch64`, and - in the line found (which looks like `ifneq ($(filter aarch64%,$(UNAME_M)),)`) -
+change `ifneq` to `ifeq`
+8. save your change using the disk icon in the upper right corner of the editor pane and open the "Terminal"
+tab again
+9. now enter the following commands:<br>&nbsp;<br>```
+make
+python3 -m pip install -r requirements.txt
+python3 convert.py ../LLaMA-2-7B-32K
+```
+10. you are now ready to run the actual quantization, e.g., using<br>&nbsp;<br>```
+./quantize ../LLaMA-2-7B-32K/ggml-model-f16.gguf \
+   ../LLaMA-2-7B-32K/LLaMA-2-7B-32K-Q4_0.gguf Q4_0
+```
+11. run any quantizations you need and stop the container again (you may even delete it as the generated files
+will remain available on your host computer
+You are now free to move the quanitization results to where you need them and run inferences with context
+lengths up to 32K (depending on the amount of memory you will have available - long contexts need an awful
+lot of RAM)
 ## License ##