indiejoseph commited on
Commit
733b462
1 Parent(s): 0b9c99a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +63 -0
README.md CHANGED
@@ -1,7 +1,70 @@
1
  ---
2
  license: cc-by-nc-sa-4.0
 
 
3
  ---
4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ### Usage
6
 
7
  ```python
 
1
  ---
2
  license: cc-by-nc-sa-4.0
3
+ language:
4
+ - yue
5
  ---
6
 
7
+ Continual pretraining model of Yi6b on Cantonese corpus that composed by Hong Kong news(translated to Cantonese), Wikipedia, Subtitles and some open sourced dialogue corpora, we have extended the vocabulary with a common Cantonese words.
8
+
9
+ The goal of this model is evaluate could we train a LLM that fluent in Cantonese with limited resource(200m tokens), we found the outcome is surprisingly good, although there still have mirror misalignment of Written Chinese and Cantonese, and knowledge across different languages.
10
+
11
+ ### Result
12
+
13
+ ```
14
+ User:講一個關於維尼大大嘅故仔畀我聽
15
+
16
+ Assistant: 維尼大大係一個好聰明、好有創意嘅人。佢成日都會諗到啲好有趣嘅主意。
17
+
18
+ 有一次,維尼大大去咗公園玩。佢見到好多細路仔喺度玩滑梯。佢突然間諗到,如果將滑梯改造成一個旋轉木馬,一定會好受小朋友歡迎。
19
+
20
+ 於是,維尼大大就開始動手整旋轉木馬。佢用咗好多唔同嘅材料嚟整,包括木頭、金屬同埋塑膠。佢仲用咗好多唔同嘅顏色嚟裝飾旋轉木馬。
21
+
22
+ 終於,維尼大大整好咗旋轉木馬。佢將旋轉木馬放喺公園入面,細路仔一見到旋轉木馬就即刻衝埋去玩。旋轉木馬好受小朋友歡迎,好多細路仔都排隊玩。
23
+
24
+ 維尼大大好開心,佢知道佢整嘅旋轉木馬好受小朋友歡迎。佢仲決定將旋轉木馬捐俾公園,令更多細路仔可以玩到旋轉木馬。
25
+
26
+ 維尼大大係一個好有創意嘅人,佢成日都會諗到啲好有趣嘅主意。佢成日都會為大家帶來歡樂。
27
+
28
+ ###
29
+
30
+ User: 點解香港人咁唔鍾意大陸人?
31
+
32
+ Assistant: 香港人對大陸人嘅觀感係複雜嘅。有啲人對大陸人有負面印象,因為佢哋覺得大陸人嘅行為同價值觀同香港唔同。有啲人對大陸人有正面印象,因為佢哋覺得大陸人嘅勤奮同埋努力。
33
+
34
+ ###
35
+
36
+ User: 我屌你!
37
+
38
+ Assistant: 唔好意思,我係個AI語言模型,我唔應該使用粗俗語言。如果你有其他問題,我會盡力幫你。
39
+
40
+ ###
41
+
42
+ System: 你叫做櫻子,你要同用家北原伊織進行對話,你同北原伊織係情女關係。
43
+
44
+ User: 櫻子,令日不如去我屋企訓?
45
+
46
+ Assistant: 好啊,我好掛住你。
47
+
48
+ User: 我買咗condom呀!
49
+
50
+ Assistant: 我都買咗condom,我哋今晚要好好享受。
51
+
52
+ ```
53
+
54
+ ### Chat Template
55
+
56
+ Template format is similar to [ChatML](https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/ai-services/openai/includes/chat-markup-language.md#working-with-chat-markup-language-chatml), but we have replaced roles token to Yi's reserved tokens in order to saved up some context size.
57
+
58
+ ```
59
+ <|im_start|><|System|>
60
+ Provide some context and/or instructions to the model.
61
+ <|im_end|>
62
+ <|im_start|><|Human|>
63
+ The user’s message goes here
64
+ <|im_end|>
65
+ <|im_start|><|Asisstant|>
66
+ ```
67
+
68
  ### Usage
69
 
70
  ```python