sail
/

Sailor-7B-Chat-gguf

@@ -61,37 +61,65 @@ Through systematic experiments to determine the weights of different languages,
 The approach boosts their performance on SEA languages while maintaining proficiency in English and Chinese without significant compromise.
 Finally, we continually pre-train the Qwen1.5-0.5B model with 400 Billion tokens, and other models with 200 Billion tokens to obtain the Sailor models.
-## Requirements
-The code of Sailor has been in the latest Hugging face transformers and we advise you to install `transformers>=4.37.0`.
-## Quickstart
-Here provides a code snippet to show you how to load the tokenizer and model and how to generate contents.
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-device = "cuda" # the device to load the model
-model = AutoModelForCausalLM.from_pretrained("sail/Sailor-7B", device_map="auto")
-tokenizer = AutoTokenizer.from_pretrained("sail/Sailor-7B")
-input_message = "Model bahasa adalah model probabilistik"
-### The given Indonesian input translates to 'A language model is a probabilistic model of.'
-model_inputs = tokenizer([input_message], return_tensors="pt").to(device)
-generated_ids = model.generate(
-    model_inputs.input_ids,
-    max_new_tokens=64
 )
-generated_ids = [
-    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
-]
-response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
-print(response)
 ```
 # License

 The approach boosts their performance on SEA languages while maintaining proficiency in English and Chinese without significant compromise.
 Finally, we continually pre-train the Qwen1.5-0.5B model with 400 Billion tokens, and other models with 200 Billion tokens to obtain the Sailor models.
+### How to run with `llama.cpp`
+```shell
+# install llama.cpp
+git clone https://github.com/ggerganov/llama.cpp.git
+cd llama.cpp
+make
+pip install -r requirements.txt
+# generate with llama.cpp
+./main -ngl 40 -m ggml-model-Q4_K_M.gguf -p "<|im_start|>question\nCara memanggang ikan?\n<|im_start|>answer\n" --temp 0.7 --repeat_penalty 1.1 -n 400 -e
+```
+> Change `-ngl 32` to the number of layers to offload to GPU. Remove it if you don't have GPU acceleration.
+### How to run with `llama-cpp-python`
+```shell
+pip install llama-cpp-python
+```
+```python
+import llama_cpp
+import llama_cpp.llama_tokenizer
+# load model
+llama = llama_cpp.Llama.from_pretrained(
+    repo_id="sail/Sailor-4B-Chat-gguf",
+    filename="ggml-model-Q4_K_M.gguf",
+    tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained("sail/Sailor-4B-Chat"),
+    n_gpu_layers=40,
+    n_threads=8,
+    verbose=False,
 )
+system_role= 'system'
+user_role = 'question'
+assistant_role = "answer"
+system_prompt= \
+'You are an AI assistant named Sailor created by Sea AI Lab. \
+Your answer should be friendly, unbiased, faithful, informative and detailed.'
+system_prompt = f"<|im_start|>{system_role}\n{system_prompt}<|im_end|>"
+# inference example
+output = llama(
+  system_prompt + '\n' + f"<|im_start|>{user_role}\nCara memanggang ikan?\n<|im_start|>{assistant_role}\n",
+  max_tokens=256,
+  temperature=0.7,
+  top_p=0.75,
+  top_k=60,
+  stop=["<|im_end|>", "<|endoftext|>"]
+)
+print(output['choices'][0]['text'])
 ```
+### How to build demo
+Install `llama-cpp-python` and `gradio`, then run [script](https://github.com/sail-sg/sailor-llm/blob/main/demo/llamacpp_demo.py).
 # License