indiejoseph commited on
Commit
5b980d8
1 Parent(s): 52f94c1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +13 -3
README.md CHANGED
@@ -4,9 +4,15 @@ language:
4
  - yue
5
  ---
6
 
7
- **This is a preview version, we still finding the balance between overfitting and generalzation of DPO training, please check Limitation section about the problem we encountered in this version**
 
 
 
 
 
 
8
 
9
- Continual pretraining model of of the [Yi6b](https://huggingface.co/01-ai/Yi-6B) model on a Cantonese corpus, which consisted of translated Hong Kong news, Wikipedia articles, subtitles, and open-sourced dialogue corpora. Additionally, we extended the vocabulary to include common Cantonese words.
10
 
11
  The goal of this model was to evaluate whether we could train a language model that is fluent in Cantonese with limited resources (200 million tokens). Surprisingly, the outcome was quite good. However, there are still some issues with mirror misalignment between written Chinese and Cantonese, as well as knowledge transfer across different languages.
12
 
@@ -126,6 +132,10 @@ print(chat(messages))
126
 
127
  The model is intended to use for Cantonese language understanding and generation tasks, it may not be suitable for other Chinese languages. The model is trained on a diverse range of Cantonese text, including news, Wikipedia, and textbooks, it may not be suitable for informal or dialectal Cantonese, it may contain bias and misinformation, please use it with caution.
128
 
 
 
 
 
129
  #### Misaligment of Written Chinese
130
 
131
  Cantonese is a spoken language and without a standard written system that makes Cantonese becames a one of low-resource language, we see some instancees of generation by this model would mixed Written Chinese and Cantonese that due to the corpus in PT stage contains a lot of mixed sentences, we belive it can be fixed in DPO stage to mitigate this issue, this is what we current working on.
@@ -133,7 +143,7 @@ Cantonese is a spoken language and without a standard written system that makes
133
  ```
134
  User: 咩嘢係狼人殺?
135
 
136
- Assistant:
137
  ```
138
 
139
  #### Misalignment of knowledge between different languages
 
4
  - yue
5
  ---
6
 
7
+ ---
8
+ license: cc-by-nc-sa-4.0
9
+ language:
10
+ - yue
11
+ ---
12
+
13
+ **This is a preview version, and this repository will be deleted once the new version is released. We are currently in the process of finding the balance between overfitting and generalization in DPO training. For more details about the problems we encountered in this version, please refer to the Limitation section.**
14
 
15
+ Continual pretraining model of the [Yi-6B](https://huggingface.co/01-ai/Yi-6B) model on a Cantonese corpus, which consisted of translated Hong Kong news, Wikipedia articles, subtitles, and open-sourced dialogue corpora. Additionally, we extended the vocabulary to include common Cantonese words.
16
 
17
  The goal of this model was to evaluate whether we could train a language model that is fluent in Cantonese with limited resources (200 million tokens). Surprisingly, the outcome was quite good. However, there are still some issues with mirror misalignment between written Chinese and Cantonese, as well as knowledge transfer across different languages.
18
 
 
132
 
133
  The model is intended to use for Cantonese language understanding and generation tasks, it may not be suitable for other Chinese languages. The model is trained on a diverse range of Cantonese text, including news, Wikipedia, and textbooks, it may not be suitable for informal or dialectal Cantonese, it may contain bias and misinformation, please use it with caution.
134
 
135
+ ### Hallucination
136
+
137
+ Most LLM also have hallucination issue, this model is no exception, it may generate incorrect or misleading information, please use it with caution.
138
+
139
  #### Misaligment of Written Chinese
140
 
141
  Cantonese is a spoken language and without a standard written system that makes Cantonese becames a one of low-resource language, we see some instancees of generation by this model would mixed Written Chinese and Cantonese that due to the corpus in PT stage contains a lot of mixed sentences, we belive it can be fixed in DPO stage to mitigate this issue, this is what we current working on.
 
143
  ```
144
  User: 咩嘢係狼人殺?
145
 
146
+ Assistant: 狼人殺係一種多人回合策略遊戲,通常有 12 至 18 名玩家參與。玩家分為兩組:狼人同村民。狼人係狼人,佢哋嘅目標係要殺死村民,而村民嘅目標係要**找**出狼人並將佢哋處決。狼人同村民都有特殊能力,例如狼人可以隱藏自己嘅身份,而村民可以通過投票來處決狼人。狼人殺遊戲通常需要 30 至 45 分鐘才能完成。
147
  ```
148
 
149
  #### Misalignment of knowledge between different languages