indiejoseph
commited on
Commit
•
d9176aa
1
Parent(s):
4db0223
Update README.md
Browse files
README.md
CHANGED
@@ -4,9 +4,9 @@ language:
|
|
4 |
- yue
|
5 |
---
|
6 |
|
7 |
-
Continual pretraining model of Yi6b on Cantonese corpus
|
8 |
|
9 |
-
The goal of this model
|
10 |
|
11 |
### Result
|
12 |
|
@@ -104,4 +104,7 @@ messages = [{"role": "user", "content": "邊個係香港特首?"}]
|
|
104 |
|
105 |
print(chat(messages))
|
106 |
|
107 |
-
```
|
|
|
|
|
|
|
|
4 |
- yue
|
5 |
---
|
6 |
|
7 |
+
Continual pretraining model of of the [Yi6b](https://huggingface.co/01-ai/Yi-6B) model on a Cantonese corpus, which consisted of translated Hong Kong news, Wikipedia articles, subtitles, and open-sourced dialogue corpora. Additionally, we extended the vocabulary to include common Cantonese words.
|
8 |
|
9 |
+
The goal of this model was to evaluate whether we could train a language model that is fluent in Cantonese with limited resources (200 million tokens). Surprisingly, the outcome was quite good. However, there are still some issues with mirror misalignment between written Chinese and Cantonese, as well as knowledge transfer across different languages.
|
10 |
|
11 |
### Result
|
12 |
|
|
|
104 |
|
105 |
print(chat(messages))
|
106 |
|
107 |
+
```
|
108 |
+
|
109 |
+
### Limitation
|
110 |
+
|