hon9kon9ize
/

CantoneseLLMChat-preview20240326

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

indiejoseph commited on Mar 29

Commit

d9176aa

•

1 Parent(s): 4db0223

Update README.md

Files changed (1) hide show

README.md +6 -3

README.md CHANGED Viewed

@@ -4,9 +4,9 @@ language:
 - yue
 ---
-Continual pretraining model of Yi6b on Cantonese corpus that composed by Hong Kong news(translated to Cantonese), Wikipedia, Subtitles and some open sourced dialogue corpora, we have extended the vocabulary with a common Cantonese words.
-The goal of this model is evaluate could we train a LLM that fluent in Cantonese with limited resource(200m tokens), we found the outcome is surprisingly good, although there still have mirror misalignment of Written Chinese and Cantonese, and knowledge across different languages.
 ### Result
@@ -104,4 +104,7 @@ messages = [{"role": "user", "content": "邊個係香港特首？"}]
 print(chat(messages))
-```

 - yue
 ---
+Continual pretraining model of of the [Yi6b](https://huggingface.co/01-ai/Yi-6B) model on a Cantonese corpus, which consisted of translated Hong Kong news, Wikipedia articles, subtitles, and open-sourced dialogue corpora. Additionally, we extended the vocabulary to include common Cantonese words.
+The goal of this model was to evaluate whether we could train a language model that is fluent in Cantonese with limited resources (200 million tokens). Surprisingly, the outcome was quite good. However, there are still some issues with mirror misalignment between written Chinese and Cantonese, as well as knowledge transfer across different languages.
 ### Result
 print(chat(messages))
+```
+### Limitation