tokyo-electron-device-ai commited on
Commit
a2534a0
1 Parent(s): 9b29313

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -13
README.md CHANGED
@@ -7,17 +7,12 @@ base_model:
7
  - meta-llama/Meta-Llama-3-8B
8
  ---
9
 
10
- ## Model Details
11
- Llama 3 tedllm is the large language models (8B) that were built by continual pre-training on the Meta Llama 3 8B models. Llama 3 tedllm is developped for enhancing the Japanese language capabilities and the domain specific data.
12
- We use approximately 173 billion tokens from a large Japanese corpus. This model was trained on the Cerebras CS-3 wafer scale systems. Cerebras' weight streaming technology simplifies the training of LLMs by disaggregating compute from model storage. This allowed for efficient scaling of training across nodes using simple data parallelism.
13
- ## Intended uses & limitations
14
 
15
- You can use the raw model for text generation or fine-tune it to a downstream task.
16
 
17
- ## How to use
18
-
19
- You can use this model directly with a pipeline for text generation.
20
- Here is how to use this model to get the features of a given text in PyTorch:
21
 
22
  ```python
23
  import torch
@@ -37,14 +32,74 @@ with torch.no_grad():
37
  print(tokenizer.decode(output))
38
  ```
39
 
40
- ## Limitations and bias
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
- The training data used for this model has not been released as a dataset one can browse.
 
 
 
 
 
 
 
43
 
44
  ## Training data
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
 
46
- The model pulished is not trained with the domain specific data. it is tranied with Japanese corpus only because the domain specific data is our specific data. We do not plan to release models trained with the domain specific data.
47
 
48
- ## Model Card Contact
 
 
 
 
 
49
 
 
50
  If you have any question, please feel free to contact [email protected].
 
7
  - meta-llama/Meta-Llama-3-8B
8
  ---
9
 
10
+ # Llama3-tedllm-8B-v0
11
+ Llama3-tedllm-8b-v0 is a bilingual Japanese-English generative model built through continuous pre-training from Llama-3-8B on approximately 173 billion tokens. This model is designed to enhance the Japanese language understanding and generation while preserving English proficiency of Llama-3.
 
 
12
 
13
+ # How to use
14
 
15
+ Below is a sample code to use this model for text generation.
 
 
 
16
 
17
  ```python
18
  import torch
 
32
  print(tokenizer.decode(output))
33
  ```
34
 
35
+ # Model Detail
36
+ * **Developed by**: TED, Cerebras Systems
37
+ * **Language(s)**: Japanese and English
38
+ * **Model architecture**: Matches LLaMA-3 8B
39
+ * **License**: Meta Llama 3 License
40
+ * **Trained from model**: LLaMa-3 8B
41
+ * **Vocabulary size**: 141,056 tokens
42
+ * **Context length**: 8192 tokens
43
+ * **Input**: text data
44
+ * **Output**: model generates text
45
+
46
+ # Intended Use & Limitations
47
+
48
+ * **Intended Use**: This model is continuously pretrained using Llama-3-8B as the foundation. The model has not been exposed to instruction tuning data. It is designed for text generation tasks and can also be fine-tuned for specific downstream applications, making it suitable for a variety of users, including researchers, developers, and businesses.
49
+ * **Limitations**: Despite its versatility, the model is trained on web-crawled datasets like mc4 and OSCAR, which may contain inaccuracies, biases, or harmful content. As a result, it can generate incorrect, misleading, or offensive outputs. Users should critically evaluate the results, especially in high-stakes or sensitive applications, and are responsible for ensuring compliance with legal and ethical standards. This model is a tool, not a source of truth, and its outputs should be verified in context.
50
 
51
+ # Training Details
52
+ ## Training process
53
+ We follow the approach described in [Bilingual Adaptation of Monolingual Foundation Models](https://arxiv.org/abs/2407.12869)for training.
54
+
55
+ - Starting with the Llama-3-8B base checkpoint, we extend the LLaMA-3 vocabulary by 10%, from 128,000 to 141,056 tokens, to increase a variety of Japanese Kanjis tokens. This improves Japanese tokenization efficiency by 21%.
56
+ - We initialize newly added embeddings using similarity-based token embedding initialization. Added embedding vectors are initialized with a weighted average of embeddings of top K most similar tokens in the original LLaMA-3 vocabulary, using an external embedding.
57
+ - We start with embedding-only training on 8.6B tokens, freezing the weights of all layers expect for the embedding and unembedding layers.
58
+ - This is followed by full continuous pre-training on 164B tokens, where all model weights are updated.
59
 
60
  ## Training data
61
+ This model was continuously trained on 173B tokens, with the training data consisting of 20% English and 80% Japanese. The raw Japanese data was filtered using scripts from [llm-jp-corpus repository](https://github.com/llm-jp/llm-jp-corpus). The following Japanese datasets were included into the training data mixture:
62
+
63
+ - [legacy-datasets/mc4](https://huggingface.co/datasets/legacy-datasets/mc4)
64
+ - [range3/cc100-ja](https://huggingface.co/datasets/range3/cc100-ja)
65
+ - [if001/oscar_2023_filtered](https://huggingface.co/datasets/if001/oscar_2023_filtered)
66
+ - [dumps.wikimedia.org](https://dumps.wikimedia.org/)
67
+ * Note this released model was trained exclusively on open-source datasets. We also trained models using proprietary domain-specific data, but there are no plans to release those models.
68
+
69
+ ## Hyper-parameters
70
+
71
+ * **batch_size**: 720
72
+ * **peak_learning_rate**: 7.5e-5
73
+ * **optimizer**: AdamW
74
+ * **weight_decay**: 0.1
75
+ * **annealing_steps**: 500
76
+
77
+ ## Training Infrastructure
78
+ The model was trained on a Cerebras Wafer-Scale Cluster, using from 4 to 16 CS-3 systems during different phases of training. Training on the Cerebras Wafer-Scale Clusters leverages Cerebras' Weight Streaming execution paradigm, which simplifies the training of LLMs by disaggregating compute from memory used for model weights. This enables efficient scaling of training across multiple nodes using simple data parallelism. You can learn more about Cerebras Wafer-Scale clusters and Weight Streaming execution [here](https://8968533.fs1.hubspotusercontent-na1.net/hubfs/8968533/Virtual%20Booth%20Docs/CS%20Weight%20Streaming%20White%20Paper.pdf).
79
+
80
+ ## Evaluation
81
+ We conducted a comprehensive evaluation of Llama3-tedllm-8b-v0 and benchmarked it against other leading Japanese-English bilingual models. Considering evaluation results in both Japanese and English, our model performs on par with the best Japanese-English bilingual models of a similar size, while offering significantly higher tokenization efficiency, which leads to a substantial reduction in inference cost.
82
+
83
+ - Japanese benchmark: [llm-jp-eval==1.4.1](https://github.com/llm-jp/llm-jp-eval/tree/v1.4.1%5D)
84
+ - English benchmark: MMLU, BoolQ, Winogrande, Hellaswag
85
+
86
+ ### Japanese Task Result
87
+
88
+ |Model|EL|FA|HE|MC|MR|NLI|QA|RC|AVG|
89
+ |---|---|---|---|---|---|---|---|---|---|
90
+ | Llama-3-8B | 0.372 | 0.254 | 0.505 | 0.647 | 0.650 | 0.634 | 0.413 | 0.868 | 0.543 |
91
+ | Llama3-tedllm-8b-v0 | 0.384 | 0.254 | 0.540 | 0.747 | 0.680 | 0.622 | 0.507 | 0.867 | 0.575 |
92
+ | Llama-3-Swallow-8B-v0.1 | 0.407 | 0.277 | 0.525 | 0.750 | 0.720 | 0.612 | 0.522 | 0.890 | 0.588 |
93
+ | Llama-3-ELYZA-JP-8B | 0.461 | 0.276 | 0.485 | 0.763 | 0.700 | 0.610 | 0.491 | 0.900 | 0.586 |
94
 
95
+ ### English Task Result
96
 
97
+ |Model| MMLU | BoolQ | Winogrande | Hellaswag | Average |
98
+ |---|---|---|---|---|---|
99
+ | Llama-3-8B | |0.622 | 0.812 | 0.728 | 0.792 | 0.738 |
100
+ | Llama3-tedllm-8b-v0 | 0.591 | 0.826 | 0.736 | 0.770 | 0.731 |
101
+ | Llama-3-Swallow-8B-v0.1 | 0.591 | 0.824 | 0.726 | 0.772 | 0.728 |
102
+ | Llama-3-ELYZA-JP-8B | 0.564 | 0.824 | 0.726 | 0.772 | 0.721 |
103
 
104
+ # Model Card Contact
105
  If you have any question, please feel free to contact [email protected].