tokyo-electron-device-ai
/

llama3-tedllm-8b-v0

Safetensors

Japanese

English

llama

Model card Files Files and versions Community

tokyo-electron-device-ai commited on 26 days ago

Commit

a2534a0

•

1 Parent(s): 9b29313

Update README.md

Browse files

Files changed (1) hide show

README.md +68 -13

README.md CHANGED Viewed

@@ -7,17 +7,12 @@ base_model:
 - meta-llama/Meta-Llama-3-8B
 ---
-## Model Details
-Llama 3 tedllm is the large language models (8B) that were built by continual pre-training on the Meta Llama 3 8B models. Llama 3 tedllm is developped for enhancing the Japanese language capabilities and the domain specific data.
-We use approximately 173 billion tokens from a large Japanese corpus. This model was trained on the Cerebras CS-3 wafer scale systems. Cerebras' weight streaming technology simplifies the training of LLMs by disaggregating compute from model storage. This allowed for efficient scaling of training across nodes using simple data parallelism.
-## Intended uses & limitations
-You can use the raw model for text generation or fine-tune it to a downstream task.
-## How to use
-You can use this model directly with a pipeline for text generation.
-Here is how to use this model to get the features of a given text in PyTorch:
 ```python
 import torch
@@ -37,14 +32,74 @@ with torch.no_grad():
 print(tokenizer.decode(output))
 ```
-## Limitations and bias
-The training data used for this model has not been released as a dataset one can browse.
 ## Training data
-The model pulished is not trained with the domain specific data. it is tranied with Japanese corpus only because the domain specific data is our specific data. We do not plan to release models trained with the domain specific data.
-## Model Card Contact
 If you have any question, please feel free to contact [email protected].

 - meta-llama/Meta-Llama-3-8B
 ---
+# Llama3-tedllm-8B-v0
+Llama3-tedllm-8b-v0 is a bilingual Japanese-English generative model built through continuous pre-training from Llama-3-8B on approximately 173 billion tokens. This model is designed to enhance the Japanese language understanding and generation while preserving English proficiency of Llama-3.
+# How to use
+Below is a sample code to use this model for text generation.
 ```python
 import torch
 print(tokenizer.decode(output))
 ```
+# Model Detail
+* **Developed by**: TED, Cerebras Systems
+* **Language(s)**: Japanese and English
+* **Model architecture**: Matches LLaMA-3 8B
+* **License**: Meta Llama 3 License
+* **Trained from model**: LLaMa-3 8B
+* **Vocabulary size**: 141,056 tokens
+* **Context length**: 8192 tokens
+* **Input**: text data
+* **Output**: model generates text
+# Intended Use & Limitations
+* **Intended Use**: This model is continuously pretrained using Llama-3-8B as the foundation. The model has not been exposed to instruction tuning data. It is designed for text generation tasks and can also be fine-tuned for specific downstream applications, making it suitable for a variety of users, including researchers, developers, and businesses.
+* **Limitations**: Despite its versatility, the model is trained on web-crawled datasets like mc4 and OSCAR, which may contain inaccuracies, biases, or harmful content. As a result, it can generate incorrect, misleading, or offensive outputs. Users should critically evaluate the results, especially in high-stakes or sensitive applications, and are responsible for ensuring compliance with legal and ethical standards. This model is a tool, not a source of truth, and its outputs should be verified in context.
+# Training Details
+## Training process
+We follow the approach described in [Bilingual Adaptation of Monolingual Foundation Models](https://arxiv.org/abs/2407.12869)for training.
+    - Starting with the Llama-3-8B base checkpoint, we extend the LLaMA-3 vocabulary by 10%, from 128,000 to 141,056 tokens, to increase a variety of Japanese Kanjis tokens. This improves Japanese tokenization efficiency by 21%.
+    - We initialize newly added embeddings using similarity-based token embedding initialization. Added embedding vectors are initialized with a weighted average of embeddings of top K most similar tokens in the original LLaMA-3 vocabulary, using an external embedding.
+    - We start with embedding-only training on 8.6B tokens, freezing the weights of all layers expect for the embedding and unembedding layers.
+    - This is followed by full continuous pre-training on 164B tokens, where all model weights are updated.
 ## Training data
+This model was continuously trained on 173B tokens, with the training data consisting of 20% English and 80% Japanese. The raw Japanese data was filtered using scripts from [llm-jp-corpus repository](https://github.com/llm-jp/llm-jp-corpus). The following Japanese datasets were included into the training data mixture:
+    - [legacy-datasets/mc4](https://huggingface.co/datasets/legacy-datasets/mc4)
+    - [range3/cc100-ja](https://huggingface.co/datasets/range3/cc100-ja)
+    - [if001/oscar_2023_filtered](https://huggingface.co/datasets/if001/oscar_2023_filtered)
+    - [dumps.wikimedia.org](https://dumps.wikimedia.org/)
+* Note this released model was trained exclusively on open-source datasets. We also trained models using proprietary domain-specific data, but there are no plans to release those models.
+## Hyper-parameters
+* **batch_size**: 720
+* **peak_learning_rate**: 7.5e-5
+* **optimizer**: AdamW
+* **weight_decay**: 0.1
+* **annealing_steps**: 500
+## Training Infrastructure
+The model was trained on a Cerebras Wafer-Scale Cluster, using from 4 to 16 CS-3 systems during different phases of training. Training on the Cerebras Wafer-Scale Clusters leverages Cerebras' Weight Streaming execution paradigm, which simplifies the training of LLMs by disaggregating compute from memory used for model weights. This enables efficient scaling of training across multiple nodes using simple data parallelism. You can learn more about Cerebras Wafer-Scale clusters and Weight Streaming execution [here](https://8968533.fs1.hubspotusercontent-na1.net/hubfs/8968533/Virtual%20Booth%20Docs/CS%20Weight%20Streaming%20White%20Paper.pdf).
+## Evaluation
+We conducted a comprehensive evaluation of Llama3-tedllm-8b-v0 and benchmarked it against other leading Japanese-English bilingual models. Considering evaluation results in both Japanese and English, our model performs on par with the best Japanese-English bilingual models of a similar size, while offering significantly higher tokenization efficiency, which leads to a substantial reduction in inference cost.
+   - Japanese benchmark: [llm-jp-eval==1.4.1](https://github.com/llm-jp/llm-jp-eval/tree/v1.4.1%5D)
+   - English benchmark: MMLU, BoolQ, Winogrande, Hellaswag
+### Japanese Task Result
+|Model|EL|FA|HE|MC|MR|NLI|QA|RC|AVG|
+|---|---|---|---|---|---|---|---|---|---|
+| Llama-3-8B | 0.372 | 0.254 | 0.505 | 0.647 | 0.650 | 0.634 | 0.413 | 0.868 | 0.543 |
+| Llama3-tedllm-8b-v0 | 0.384 |	0.254 |	0.540 | 0.747 |	0.680 | 0.622 | 0.507 |	0.867 |	0.575 |
+| Llama-3-Swallow-8B-v0.1 | 0.407 | 0.277 | 0.525 |	0.750 |	0.720 |	0.612 |	0.522 |	0.890 | 0.588 |
+| Llama-3-ELYZA-JP-8B | 0.461 |	0.276 |	0.485 |	0.763 | 0.700 |	0.610 |	0.491 |	0.900 |	0.586 |
+### English Task Result
+|Model| MMLU | BoolQ | Winogrande | Hellaswag | Average |
+|---|---|---|---|---|---|
+| Llama-3-8B | |0.622 |	0.812 |	0.728 |	0.792 |	0.738 |
+| Llama3-tedllm-8b-v0 |	0.591 | 0.826 |	0.736 |	0.770 |	0.731 |
+| Llama-3-Swallow-8B-v0.1 | 0.591 | 0.824 | 0.726 |	0.772 |	0.728 |
+| Llama-3-ELYZA-JP-8B |	0.564 | 0.824 | 0.726 | 0.772 | 0.721 |
+# Model Card Contact
 If you have any question, please feel free to contact [email protected].