duyntnet commited on
Commit
051367e
1 Parent(s): fe91c12

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +109 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ inference: false
7
+ tags:
8
+ - transformers
9
+ - gguf
10
+ - imatrix
11
+ - Sailor2-8B-Chat
12
+ ---
13
+ Quantizations of https://huggingface.co/sail/Sailor2-8B-Chat
14
+
15
+ ### Inference Clients/UIs
16
+ * [llama.cpp](https://github.com/ggerganov/llama.cpp)
17
+ * [KoboldCPP](https://github.com/LostRuins/koboldcpp)
18
+ * [ollama](https://github.com/ollama/ollama)
19
+ * [jan](https://github.com/janhq/jan)
20
+ * [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
21
+ * [GPT4All](https://github.com/nomic-ai/gpt4all)
22
+ ---
23
+
24
+ # From original readme
25
+
26
+ Sailor2 is a community-driven initiative that brings cutting-edge multilingual language models to South-East Asia (SEA).
27
+ Our research highlights a strong demand for models in the **8B and 20B parameter** range for production use, alongside **1B models** for specialized applications,
28
+ such as speculative decoding and research purposes.
29
+ These models, released under the **Apache 2.0 license**, provide enhanced accessibility to advanced language technologies across the region.
30
+
31
+
32
+ Sailor2 builds upon the foundation of the awesome multilingual model [Qwen 2.5](https://huggingface.co/collections/Qwen/qwen25-66e81a666513e518adb90d9e) and
33
+ is continuously pre-trained on **500B tokens** to support **15 languages** better with a unified model.
34
+ These languages include English, Chinese, Burmese, Cebuano, Ilocano, Indonesian, Javanese, Khmer, Lao, Malay, Sundanese, Tagalog, Thai, Vietnamese, and Waray.
35
+ By addressing the growing demand for diverse, robust, and accessible language models, Sailor2 seeks to serve the underserved in SEA areas with open, inclusive, and accessible multilingual LLMs.
36
+ The Sailor2 model comes in three sizes, 1B, 8B, and 20B, which are **expanded from the Qwen2.5 base models** of 0.5B, 7B, and 14B, respectively.
37
+
38
+ ## Model Summary
39
+ - **Model Collections:** [Base Model & Chat Model](https://huggingface.co/collections/sail/sailor2-language-models-674d7c9e6b4dbbd9a869906b)
40
+ - **Project Website:** [sea-sailor.github.io/blog/sailor2/](https://sea-sailor.github.io/blog/sailor2/)
41
+ - **Codebase:** [github.com/sail-sg/sailor2](https://github.com/sail-sg/sailor2)
42
+ - **Technical Report:** Coming Soon
43
+
44
+
45
+ ## Training details
46
+
47
+ During development, we employ a range of advanced technologies to ensure top-tier performance and efficiency:
48
+
49
+ 1. model expansion
50
+ 2. optimized data mixing strategies
51
+ 3. multi-stage pre-training protocols
52
+ 4. advanced multilingual post-training
53
+
54
+ Please refer to [Sailor2 Blog](https://sea-sailor.github.io/blog/sailor2/) for more training details.
55
+
56
+
57
+ ## Requirements
58
+ The code of Sailor2 has been in the latest Hugging face transformers and we advise you to install `transformers==4.46.3`.
59
+
60
+ ## Quickstart
61
+
62
+ Here provides a code snippet to show you how to load the tokenizer and model and how to generate contents.
63
+
64
+ ```python
65
+ import torch
66
+ from transformers import AutoModelForCausalLM, AutoTokenizer
67
+ device = "cuda"
68
+
69
+ model = AutoModelForCausalLM.from_pretrained(
70
+ 'sail/Sailor2-20B-Chat',
71
+ torch_dtype=torch.bfloat16,
72
+ device_map="auto"
73
+ )
74
+
75
+ tokenizer = AutoTokenizer.from_pretrained('sail/Sailor2-8B-Chat')
76
+ system_prompt= \
77
+ 'You are an AI assistant named Sailor2, created by Sea AI Lab. \
78
+ As an AI assistant, you can answer questions in English, Chinese, and Southeast Asian languages \
79
+ such as Burmese, Cebuano, Ilocano, Indonesian, Javanese, Khmer, Lao, Malay, Sundanese, Tagalog, Thai, Vietnamese, and Waray. \
80
+ Your responses should be friendly, unbiased, informative, detailed, and faithful.'
81
+
82
+ prompt = "Beri saya pengenalan singkat tentang model bahasa besar."
83
+ # prompt = "Hãy cho tôi một giới thiệu ngắn gọn về mô hình ngôn ngữ lớn."
84
+ # prompt = "ให้ฉันแนะนำสั้น ๆ เกี่ยวกับโมเดลภาษาขนาดใหญ่"
85
+
86
+ messages = [
87
+ {"role": "system", "content": system_prompt},
88
+ {"role": "user", "content": prompt}
89
+ ]
90
+ text = tokenizer.apply_chat_template(
91
+ messages,
92
+ tokenize=False,
93
+ add_generation_prompt=True
94
+ )
95
+
96
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
97
+ input_ids = model_inputs.input_ids.to(device)
98
+
99
+ generated_ids = model.generate(
100
+ input_ids,
101
+ max_new_tokens=512,
102
+ )
103
+
104
+ generated_ids = [
105
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
106
+ ]
107
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
108
+ print(response)
109
+ ```