Update README.md
Browse files
README.md
CHANGED
@@ -28,6 +28,7 @@ SambaLingo-Russian-Base is a pretrained Bi-lingual Russian and English model tha
|
|
28 |
- **Language(s):** Russian, English
|
29 |
- **Finetuned from model:** [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf)
|
30 |
- **Try the chat version of this model**: [SambaLingo-chat-space](https://huggingface.co/spaces/sambanovasystems/SambaLingo-chat-space).
|
|
|
31 |
- **Blog Post**: [sambalingo-open-source-language-experts](https://sambanova.ai/blog/sambalingo-open-source-language-experts)
|
32 |
|
33 |
## Getting Started
|
@@ -53,19 +54,7 @@ All pre-training is done on the [Cultura-X](https://huggingface.co/datasets/uonl
|
|
53 |
We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
|
54 |
|
55 |
## Evaluation Results
|
56 |
-
|
57 |
-
|------------------------------------------|-----------------------------------|--------------------------------------|----------------------|--------------------|---------------------|--------|
|
58 |
-
| Holdout Perplexity (Lower is better) | **1.444** | 1.556 | 1.611 | 1.797 | 1.504 | 1.806 |
|
59 |
-
| FLORES en->ru (8 shot, CHRF) | **0.472** | 0.425 | 0.319 | 0.204 | 0.263 | 0.211 |
|
60 |
-
| FLORES ru->en (8 shot, CHRF) | **0.587** | 0.527 | 0.317 | 0.258 | 0.429 | 0.251 |
|
61 |
-
| FLORES en->ru (8 shot, BLEU) | **0.194** | 0.145 | 0.074 | 0.012 | 0.045 | 0.021 |
|
62 |
-
| FLORES ru->en (8 shot, BLEU) | **0.301** | 0.249 | 0.062 | 0.032 | 0.152 | 0.039 |
|
63 |
-
| Belebele (3 shot) | **39.00%** | 34.44% | 24.33% | 29.00% | 21.89% | 23.67% |
|
64 |
-
| SIB-200 (3 shot) | 69.12% | **78.92%** | 32.84% | 46.08% | 63.73% | 42.65% |
|
65 |
-
| XNLI (0 shot) | 35.29% | **49.78%** | 45.61% | 42.61% | 46.39% | 45.39% |
|
66 |
-
| XStoryCloze (0 shot) | **71.67%** | 68.96% | 60.75% | 52.68% | 63.40% | 59.43% |
|
67 |
-
| XWinograd (0 shot) | **69.21%** | 66.67% | 60.63% | 57.14% | 63.17% | 60.00% |
|
68 |
-
|
69 |
|
70 |
## Uses
|
71 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
@@ -107,12 +96,12 @@ We would like to give a special thanks to the following groups:
|
|
107 |
|
108 |
## Cite SambaLingo
|
109 |
```
|
110 |
-
@
|
111 |
-
|
112 |
-
|
113 |
-
|
114 |
-
|
115 |
-
|
116 |
-
|
117 |
}
|
118 |
```
|
|
|
28 |
- **Language(s):** Russian, English
|
29 |
- **Finetuned from model:** [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf)
|
30 |
- **Try the chat version of this model**: [SambaLingo-chat-space](https://huggingface.co/spaces/sambanovasystems/SambaLingo-chat-space).
|
31 |
+
- **Paper:** [SambaLingo: Teaching Large Language Models New Languages](https://arxiv.org/abs/2404.05829)
|
32 |
- **Blog Post**: [sambalingo-open-source-language-experts](https://sambanova.ai/blog/sambalingo-open-source-language-experts)
|
33 |
|
34 |
## Getting Started
|
|
|
54 |
We extended the vocabulary of the base llama model from 32,000 tokens to 57,000 tokens by adding up to 25,000 non-overlapping tokens from the new language.
|
55 |
|
56 |
## Evaluation Results
|
57 |
+
For evaluation results see our paper: [SambaLingo: Teaching Large Language Models New Languages](https://arxiv.org/abs/2404.05829)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
|
59 |
## Uses
|
60 |
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
|
|
96 |
|
97 |
## Cite SambaLingo
|
98 |
```
|
99 |
+
@misc{csaki2024sambalingo,
|
100 |
+
title={SambaLingo: Teaching Large Language Models New Languages},
|
101 |
+
author={Zoltan Csaki and Bo Li and Jonathan Li and Qiantong Xu and Pian Pawakapan and Leon Zhang and Yun Du and Hengyu Zhao and Changran Hu and Urmish Thakker},
|
102 |
+
year={2024},
|
103 |
+
eprint={2404.05829},
|
104 |
+
archivePrefix={arXiv},
|
105 |
+
primaryClass={cs.CL}
|
106 |
}
|
107 |
```
|