indiejoseph commited on
Commit
d9176aa
1 Parent(s): 4db0223

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -3
README.md CHANGED
@@ -4,9 +4,9 @@ language:
4
  - yue
5
  ---
6
 
7
- Continual pretraining model of Yi6b on Cantonese corpus that composed by Hong Kong news(translated to Cantonese), Wikipedia, Subtitles and some open sourced dialogue corpora, we have extended the vocabulary with a common Cantonese words.
8
 
9
- The goal of this model is evaluate could we train a LLM that fluent in Cantonese with limited resource(200m tokens), we found the outcome is surprisingly good, although there still have mirror misalignment of Written Chinese and Cantonese, and knowledge across different languages.
10
 
11
  ### Result
12
 
@@ -104,4 +104,7 @@ messages = [{"role": "user", "content": "邊個係香港特首?"}]
104
 
105
  print(chat(messages))
106
 
107
- ```
 
 
 
 
4
  - yue
5
  ---
6
 
7
+ Continual pretraining model of of the [Yi6b](https://huggingface.co/01-ai/Yi-6B) model on a Cantonese corpus, which consisted of translated Hong Kong news, Wikipedia articles, subtitles, and open-sourced dialogue corpora. Additionally, we extended the vocabulary to include common Cantonese words.
8
 
9
+ The goal of this model was to evaluate whether we could train a language model that is fluent in Cantonese with limited resources (200 million tokens). Surprisingly, the outcome was quite good. However, there are still some issues with mirror misalignment between written Chinese and Cantonese, as well as knowledge transfer across different languages.
10
 
11
  ### Result
12
 
 
104
 
105
  print(chat(messages))
106
 
107
+ ```
108
+
109
+ ### Limitation
110
+