YX-Cerebras commited on
Commit
cb84391
1 Parent(s): e16f8cb

Update README.md

Browse files

Minor format update

Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -52,10 +52,10 @@ print(tokenizer.decode(output))
52
  ### Training process
53
  We follow the approach described in [Bilingual Adaptation of Monolingual Foundation Models](https://arxiv.org/abs/2407.12869) for training.
54
 
55
- - Starting with the Llama-3-8B base checkpoint, we extend the LLaMA-3 vocabulary by 10%, from 128,000 to 141,056 tokens, to increase a variety of Japanese Kanjis tokens. This improves Japanese tokenization efficiency by 21%.
56
- - We initialize newly added embeddings using similarity-based token embedding initialization. Added embedding vectors are initialized with a weighted average of embeddings of top K most similar tokens in the original LLaMA-3 vocabulary, using an external embedding.
57
- - We start with embedding-only training on 8.6B tokens, freezing the weights of all layers expect for the embedding and unembedding layers.
58
- - This is followed by full continuous pre-training on 164B tokens, where all model weights are updated.
59
 
60
  ### Training data
61
  This model was continuously trained on 173B tokens, with the training data consisting of 20% English and 80% Japanese. The raw Japanese data was filtered using scripts from [llm-jp-corpus repository](https://github.com/llm-jp/llm-jp-corpus). The following Japanese datasets were included into the training data mixture:
 
52
  ### Training process
53
  We follow the approach described in [Bilingual Adaptation of Monolingual Foundation Models](https://arxiv.org/abs/2407.12869) for training.
54
 
55
+ - Starting with the Llama-3-8B base checkpoint, we extend the LLaMA-3 vocabulary by 10%, from 128,000 to 141,056 tokens, to increase a variety of Japanese Kanjis tokens. This improves Japanese tokenization efficiency by 21%.
56
+ - We initialize newly added embeddings using similarity-based token embedding initialization. Added embedding vectors are initialized with a weighted average of embeddings of top K most similar tokens in the original LLaMA-3 vocabulary, using an external embedding.
57
+ - We start with embedding-only training on 8.6B tokens, freezing the weights of all layers expect for the embedding and unembedding layers.
58
+ - This is followed by full continuous pre-training on 164B tokens, where all model weights are updated.
59
 
60
  ### Training data
61
  This model was continuously trained on 173B tokens, with the training data consisting of 20% English and 80% Japanese. The raw Japanese data was filtered using scripts from [llm-jp-corpus repository](https://github.com/llm-jp/llm-jp-corpus). The following Japanese datasets were included into the training data mixture: