Qwen
/

Qwen2-0.5B-Instruct-GGUF

Text Generation

Inference Endpoints

Model card Files Files and versions Community

littlebird13 commited on Jun 18, 2024

Commit

879a584

·

verified ·

1 Parent(s): 6856bad

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -43,7 +43,7 @@ To run Qwen2, you can use `llama-cli` (the previous `main`) or `llama-server` (t
 We recommend using the `llama-server` as it is simple and compatible with OpenAI API. For example:
 ```bash
-./llama-server -m qwen2-0.5b-instruct-q5_k_m.gguf -ngl 24 -fa
 ```
 (Note: `-ngl 24` refers to offloading 24 layers to GPUs, and `-fa` refers to the use of flash attention.)
@@ -71,7 +71,7 @@ print(completion.choices[0].message.content)
 If you choose to use `llama-cli`, pay attention to the removal of `-cml` for the ChatML template. Instead you should use `--in-prefix` and `--in-suffix` to tackle this problem.
 ```bash
-./llama-cli -m qwen2-0.5b-instruct-q5_k_m.gguf \
   -n 512 -co -i -if -f prompts/chat-with-qwen.txt \
   --in-prefix "<|im_start|>user\n" \
   --in-suffix "<|im_end|>\n<|im_start|>assistant\n" \

 We recommend using the `llama-server` as it is simple and compatible with OpenAI API. For example:
 ```bash
+./llama-server -m qwen2-0_5b-instruct-q5_k_m.gguf -ngl 24 -fa
 ```
 (Note: `-ngl 24` refers to offloading 24 layers to GPUs, and `-fa` refers to the use of flash attention.)
 If you choose to use `llama-cli`, pay attention to the removal of `-cml` for the ChatML template. Instead you should use `--in-prefix` and `--in-suffix` to tackle this problem.
 ```bash
+./llama-cli -m qwen2-0_5b-instruct-q5_k_m.gguf \
   -n 512 -co -i -if -f prompts/chat-with-qwen.txt \
   --in-prefix "<|im_start|>user\n" \
   --in-suffix "<|im_end|>\n<|im_start|>assistant\n" \