indiejoseph
commited on
Commit
•
6a66a97
1
Parent(s):
5c00a83
Update README.md
Browse files
README.md
CHANGED
@@ -4,15 +4,12 @@ language:
|
|
4 |
- yue
|
5 |
---
|
6 |
|
7 |
-
|
8 |
|
9 |
-
|
10 |
-
|
11 |
-
The goal of this model was to evaluate whether we could train a language model that is fluent in Cantonese with limited resources (200 million tokens). Surprisingly, the outcome was quite good. However, there are still some issues with mirror misalignment between written Chinese and Cantonese, as well as knowledge transfer across different languages.
|
12 |
|
13 |
Here is a space you can interact with [CantoneseLLMChat](https://huggingface.co/spaces/hon9kon9ize/CantoneseLLMChat)
|
14 |
|
15 |
-
[Technical Report](https://hon9kon9ize.com/posts/2024-04-28-cantonesellm_tech_report)
|
16 |
|
17 |
### Result
|
18 |
|
|
|
4 |
- yue
|
5 |
---
|
6 |
|
7 |
+
Continual pretraining model of the [Yi-6B](https://huggingface.co/01-ai/Yi-1.5-6B) model on a Cantonese corpus, which consisted of translated Hong Kong news, Wikipedia articles, subtitles, and open-sourced dialogue corpora. Additionally, we extended the vocabulary to include common Cantonese words.
|
8 |
|
9 |
+
The goal of this model was to evaluate whether we could train a language model that is fluent in Cantonese with limited resources (400 million tokens). Surprisingly, the outcome was quite good. However, there are still some issues with mirror misalignment between written Chinese and Cantonese, as well as knowledge transfer across different languages.
|
|
|
|
|
10 |
|
11 |
Here is a space you can interact with [CantoneseLLMChat](https://huggingface.co/spaces/hon9kon9ize/CantoneseLLMChat)
|
12 |
|
|
|
13 |
|
14 |
### Result
|
15 |
|