tokyo-electron-device-ai
commited on
Commit
•
eece17e
1
Parent(s):
42b973c
Update README.md
Browse files
README.md
CHANGED
@@ -60,10 +60,10 @@ We follow the approach described in [Bilingual Adaptation of Monolingual Foundat
|
|
60 |
### Training data
|
61 |
This model was continuously trained on 173B tokens, with the training data consisting of 20% English and 80% Japanese. The raw Japanese data was filtered using scripts from [llm-jp-corpus repository](https://github.com/llm-jp/llm-jp-corpus). The following Japanese datasets were included into the training data mixture:
|
62 |
|
63 |
-
|
64 |
-
|
65 |
-
|
66 |
-
|
67 |
* Note this released model was trained exclusively on open-source datasets. We also trained models using proprietary domain-specific data, but there are no plans to release those models.
|
68 |
|
69 |
### Hyper-parameters
|
|
|
60 |
### Training data
|
61 |
This model was continuously trained on 173B tokens, with the training data consisting of 20% English and 80% Japanese. The raw Japanese data was filtered using scripts from [llm-jp-corpus repository](https://github.com/llm-jp/llm-jp-corpus). The following Japanese datasets were included into the training data mixture:
|
62 |
|
63 |
+
* **[legacy-datasets/mc4](https://huggingface.co/datasets/legacy-datasets/mc4)**
|
64 |
+
* **[range3/cc100-ja](https://huggingface.co/datasets/range3/cc100-ja)**
|
65 |
+
* **[if001/oscar_2023_filtered](https://huggingface.co/datasets/if001/oscar_2023_filtered)**
|
66 |
+
* **[dumps.wikimedia.org](https://dumps.wikimedia.org/)**
|
67 |
* Note this released model was trained exclusively on open-source datasets. We also trained models using proprietary domain-specific data, but there are no plans to release those models.
|
68 |
|
69 |
### Hyper-parameters
|