TheBloke commited on
Commit
58459f9
1 Parent(s): d26045d

Initial GGML model commit

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -35,7 +35,7 @@ tags:
35
 
36
  This repo contains GGML format model files for [Mikael110's Llama2 70b Guanaco QLoRA](https://huggingface.co/Mikael110/llama-2-70b-guanaco-qlora).
37
 
38
- These 70B Llama 2 GGML files currently only support CPU inference. They are known to work with:
39
  * [llama.cpp](https://github.com/ggerganov/llama.cpp), commit `e76d630` and later.
40
  * [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most popular web UI.
41
  * [KoboldCpp](https://github.com/LostRuins/koboldcpp), version 1.37 and later. A powerful GGML web UI, especially good for story telling.
@@ -62,8 +62,6 @@ These 70B Llama 2 GGML files currently only support CPU inference. They are kno
62
 
63
  Or one of the other tools and libraries listed above.
64
 
65
- There is currently no GPU acceleration; only CPU can be used.
66
-
67
  To use in llama.cpp, you must add `-gqa 8` argument.
68
 
69
  For other UIs and libraries, please check the docs.
@@ -107,10 +105,12 @@ Refer to the Provided Files table below to see what files use which methods, and
107
  I use the following command line; adjust for your tastes and needs:
108
 
109
  ```
110
- ./main -t 10 -gqa 8 -m llama-2-70b-guanaco-qlora.ggmlv3.q4_K_M.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Human: Write a story about llamas\n### Assistant:"
111
  ```
112
  Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
113
 
 
 
114
  Remember the `-gqa 8` argument, required for Llama 70B models.
115
 
116
  If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`
 
35
 
36
  This repo contains GGML format model files for [Mikael110's Llama2 70b Guanaco QLoRA](https://huggingface.co/Mikael110/llama-2-70b-guanaco-qlora).
37
 
38
+ CUDA GPU acceleration is now available for Llama 2 70B GGML files. Metal acceleration (macOS) is not yet available. I haven't tested AMD acceleration - let me know if it owrks. The following clients/libraries are known to work with these files, including with CUDA GPU acceleration:
39
  * [llama.cpp](https://github.com/ggerganov/llama.cpp), commit `e76d630` and later.
40
  * [text-generation-webui](https://github.com/oobabooga/text-generation-webui), the most popular web UI.
41
  * [KoboldCpp](https://github.com/LostRuins/koboldcpp), version 1.37 and later. A powerful GGML web UI, especially good for story telling.
 
62
 
63
  Or one of the other tools and libraries listed above.
64
 
 
 
65
  To use in llama.cpp, you must add `-gqa 8` argument.
66
 
67
  For other UIs and libraries, please check the docs.
 
105
  I use the following command line; adjust for your tastes and needs:
106
 
107
  ```
108
+ ./main -t 10 -ngl 40 -gqa 8 -m llama-2-70b-guanaco-qlora.ggmlv3.q4_K_M.bin --color -c 4096 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "### Human: Write a story about llamas\n### Assistant:"
109
  ```
110
  Change `-t 10` to the number of physical CPU cores you have. For example if your system has 8 cores/16 threads, use `-t 8`.
111
 
112
+ Change -ngl 40 to the number of GPU layers you have VRAM for. Use -ngl 100 to offload all layers to VRAM, if you have a 48GB card, or 2 x 24GB, or similar. Otherwise you can partially offload as many as you have VRAM for, on one or more GPUs.
113
+
114
  Remember the `-gqa 8` argument, required for Llama 70B models.
115
 
116
  If you want to have a chat-style conversation, replace the `-p <PROMPT>` argument with `-i -ins`