Update README.md
Browse files
README.md
CHANGED
@@ -9,3 +9,39 @@ We construct a multilingual pre-trained model named MiLMo that performs better o
|
|
9 |
(2) We train a word2vec representation for five languages, including Mongolian, Tibetan, Uygur, Kazakh and Korean. Comparing the word2vec representation and the pre-trained model in the downstream task of text classification, we provide the best scheme for the research of downstream task of minority languages. The experimental results show that MiLMo model outperforms the word2vec representation.
|
10 |
|
11 |
(3) To solve the problem of scarcity of minority language datasets, we construct a classification dataset MiTC containing five languages, including Mongolian, Tibetan, Uyghur, Kazakh and Korean, and publishes the word2vec representation, multilingual pre-trained model Mi XLM and multilingual classification dataset MiTC.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
9 |
(2) We train a word2vec representation for five languages, including Mongolian, Tibetan, Uygur, Kazakh and Korean. Comparing the word2vec representation and the pre-trained model in the downstream task of text classification, we provide the best scheme for the research of downstream task of minority languages. The experimental results show that MiLMo model outperforms the word2vec representation.
|
10 |
|
11 |
(3) To solve the problem of scarcity of minority language datasets, we construct a classification dataset MiTC containing five languages, including Mongolian, Tibetan, Uyghur, Kazakh and Korean, and publishes the word2vec representation, multilingual pre-trained model Mi XLM and multilingual classification dataset MiTC.
|
12 |
+
|
13 |
+
## Experimental Result:
|
14 |
+
|
15 |
+
We obtain the training data of five minority languages, including Mongolian, Tibetan, Uyghur, Kazakh and Korean, The word segmentation results of the five languages are shown in Table 1.
|
16 |
+
|
17 |
+
<p align="center">Table 1: The results of word segmentation in each minority language.</p>
|
18 |
+
<p align="center"> <img src="https://github.com/user-attachments/assets/635a3fbb-e7e5-49dc-953d-437f5b9d940f" width="800" /></p>
|
19 |
+
|
20 |
+
We use MiLMo for the downstream experiment of text classification on MiTC.
|
21 |
+
<p align="center">The performances of MiLMo on text classification</p>
|
22 |
+
<p align="center"> <img src="https://github.com/user-attachments/assets/0f6c4a64-6390-4ab9-bdd7-b36b7fbe0162" width="800" /></p>
|
23 |
+
|
24 |
+
## Download
|
25 |
+
[Paper](https://ieeexplore.ieee.org/document/10393961)
|
26 |
+
[MiLMo](https://huggingface.co/CMLI-NLP/MiLMo)
|
27 |
+
[Word2vec](https://huggingface.co/CMLI-NLP/MiLMo)
|
28 |
+
[Data Set](https://huggingface.co/CMLI-NLP/MiLMo)
|
29 |
+
|
30 |
+
## Citation
|
31 |
+
|
32 |
+
Plain Text:
|
33 |
+
J. Deng, H. Shi, X. Yu, W. Bao, Y. Sun and X. Zhao, "MiLMo:Minority Multilingual Pre-Trained Language Model," 2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Honolulu, Oahu, HI, USA, 2023, pp. 329-334, doi: 10.1109/SMC53992.2023.10393961.
|
34 |
+
|
35 |
+
BibTeX:
|
36 |
+
```
|
37 |
+
@INPROCEEDINGS{10393961,
|
38 |
+
author={Deng, Junjie and Shi, Hanru and Yu, Xinhe and Bao, Wugedele and Sun, Yuan and Zhao, Xiaobing},
|
39 |
+
booktitle={2023 IEEE International Conference on Systems, Man, and Cybernetics (SMC)},
|
40 |
+
title={MiLMo:Minority Multilingual Pre-Trained Language Model},
|
41 |
+
year={2023},
|
42 |
+
volume={},
|
43 |
+
number={},
|
44 |
+
pages={329-334},
|
45 |
+
keywords={Soft sensors;Text categorization;Social sciences;Government;Data acquisition;Morphology;Data models;Multilingual;Pre-trained language model;Datasets;Word2vec},
|
46 |
+
doi={10.1109/SMC53992.2023.10393961}}
|
47 |
+
```
|