Update README.md
#1
by
YX-Cerebras
- opened
README.md
CHANGED
@@ -52,10 +52,10 @@ print(tokenizer.decode(output))
|
|
52 |
### Training process
|
53 |
We follow the approach described in [Bilingual Adaptation of Monolingual Foundation Models](https://arxiv.org/abs/2407.12869) for training.
|
54 |
|
55 |
-
|
56 |
-
|
57 |
-
|
58 |
-
|
59 |
|
60 |
### Training data
|
61 |
This model was continuously trained on 173B tokens, with the training data consisting of 20% English and 80% Japanese. The raw Japanese data was filtered using scripts from [llm-jp-corpus repository](https://github.com/llm-jp/llm-jp-corpus). The following Japanese datasets were included into the training data mixture:
|
|
|
52 |
### Training process
|
53 |
We follow the approach described in [Bilingual Adaptation of Monolingual Foundation Models](https://arxiv.org/abs/2407.12869) for training.
|
54 |
|
55 |
+
- Starting with the Llama-3-8B base checkpoint, we extend the LLaMA-3 vocabulary by 10%, from 128,000 to 141,056 tokens, to increase a variety of Japanese Kanjis tokens. This improves Japanese tokenization efficiency by 21%.
|
56 |
+
- We initialize newly added embeddings using similarity-based token embedding initialization. Added embedding vectors are initialized with a weighted average of embeddings of top K most similar tokens in the original LLaMA-3 vocabulary, using an external embedding.
|
57 |
+
- We start with embedding-only training on 8.6B tokens, freezing the weights of all layers expect for the embedding and unembedding layers.
|
58 |
+
- This is followed by full continuous pre-training on 164B tokens, where all model weights are updated.
|
59 |
|
60 |
### Training data
|
61 |
This model was continuously trained on 173B tokens, with the training data consisting of 20% English and 80% Japanese. The raw Japanese data was filtered using scripts from [llm-jp-corpus repository](https://github.com/llm-jp/llm-jp-corpus). The following Japanese datasets were included into the training data mixture:
|