license: llama3
language:
- ja
- en
base_model:
- meta-llama/Meta-Llama-3-8B
Llama3-tedllm-8B-v0
Llama3-tedllm-8b-v0 is a bilingual Japanese-English generative model built through continuous pre-training from Llama-3-8B on approximately 173 billion tokens. This model is designed to enhance the Japanese language understanding and generation while preserving English proficiency of Llama-3.
How to use
Below is a sample code to use this model for text generation.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tokyo-electron-device-ai/llama3-tedllm-8b-v0")
model = AutoModelForCausalLM.from_pretrained("tokyo-electron-device-ai/llama3-tedllm-8b-v0", device_map="auto", torch_dtype=torch.bfloat16)
text = "人工知能とは何か説明してください"
tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
tokenized_input,
max_new_tokens=50,
do_sample=True,
top_p=0.9,
temperature=0.6,
)[0]
print(tokenizer.decode(output))
Model Details
- Developed by: TED, Cerebras Systems
- Language(s): Japanese and English
- Model architecture: Matches LLaMA-3 8B
- License: Meta Llama 3 License
- Trained from model: LLaMa-3 8B
- Vocabulary size: 141,056 tokens
- Context length: 8192 tokens
- Input: text data
- Output: model generates text
Intended Use & Limitations
- Intended Use: This model is continuously pretrained using Llama-3-8B as the foundation. The model has not been exposed to instruction tuning data. It is designed for text generation tasks and can also be fine-tuned for specific downstream applications, making it suitable for a variety of users, including researchers, developers, and businesses.
- Limitations: Despite its versatility, the model is trained on web-crawled datasets like mc4 and OSCAR, which may contain inaccuracies, biases, or harmful content. As a result, it can generate incorrect, misleading, or offensive outputs. Users should critically evaluate the results, especially in high-stakes or sensitive applications, and are responsible for ensuring compliance with legal and ethical standards. This model is a tool, not a source of truth, and its outputs should be verified in context.
Training Details
Training process
We follow the approach described in Bilingual Adaptation of Monolingual Foundation Models for training.
- Starting with the Llama-3-8B base checkpoint, we extend the LLaMA-3 vocabulary by 10%, from 128,000 to 141,056 tokens, to increase a variety of Japanese Kanjis tokens. This improves Japanese tokenization efficiency by 21%.
- We initialize newly added embeddings using similarity-based token embedding initialization. Added embedding vectors are initialized with a weighted average of embeddings of top K most similar tokens in the original LLaMA-3 vocabulary, using an external embedding.
- We start with embedding-only training on 8.6B tokens, freezing the weights of all layers expect for the embedding and unembedding layers.
- This is followed by full continuous pre-training on 164B tokens, where all model weights are updated.
Training data
This model was continuously trained on 173B tokens, with the training data consisting of 20% English and 80% Japanese. The raw Japanese data was filtered using scripts from llm-jp-corpus repository. The following Japanese datasets were included into the training data mixture:
- [legacy-datasets/mc4](https://huggingface.co/datasets/legacy-datasets/mc4)
- [range3/cc100-ja](https://huggingface.co/datasets/range3/cc100-ja)
- [if001/oscar_2023_filtered](https://huggingface.co/datasets/if001/oscar_2023_filtered)
- [dumps.wikimedia.org](https://dumps.wikimedia.org/)
- Note this released model was trained exclusively on open-source datasets. We also trained models using proprietary domain-specific data, but there are no plans to release those models.
Hyper-parameters
- batch_size: 720
- peak_learning_rate: 7.5e-5
- optimizer: AdamW
- weight_decay: 0.1
- annealing_steps: 500
Training Infrastructure
The model was trained on a Cerebras Wafer-Scale Cluster, using from 4 to 16 CS-3 systems during different phases of training. Training on the Cerebras Wafer-Scale Clusters leverages Cerebras' Weight Streaming execution paradigm, which simplifies the training of LLMs by disaggregating compute from memory used for model weights. This enables efficient scaling of training across multiple nodes using simple data parallelism. You can learn more about Cerebras Wafer-Scale clusters and Weight Streaming execution here.
Evaluation
We conducted a comprehensive evaluation of Llama3-tedllm-8b-v0 and benchmarked it against other leading Japanese-English bilingual models. Considering evaluation results in both Japanese and English, our model performs on par with the best Japanese-English bilingual models of a similar size, while offering significantly higher tokenization efficiency, which leads to a substantial reduction in inference cost.
- Japanese benchmark: llm-jp-eval==1.4.1
- English benchmark: MMLU, BoolQ, Winogrande, Hellaswag
Japanese Task Result
Model | EL | FA | HE | MC | MR | NLI | QA | RC | AVG |
---|---|---|---|---|---|---|---|---|---|
Llama-3-8B | 0.372 | 0.254 | 0.505 | 0.647 | 0.650 | 0.634 | 0.413 | 0.868 | 0.543 |
Llama3-tedllm-8b-v0 | 0.384 | 0.254 | 0.540 | 0.747 | 0.680 | 0.622 | 0.507 | 0.867 | 0.575 |
Llama-3-Swallow-8B-v0.1 | 0.407 | 0.277 | 0.525 | 0.750 | 0.720 | 0.612 | 0.522 | 0.890 | 0.588 |
Llama-3-ELYZA-JP-8B | 0.461 | 0.276 | 0.485 | 0.763 | 0.700 | 0.610 | 0.491 | 0.900 | 0.586 |
English Task Result
Model | MMLU | BoolQ | Winogrande | Hellaswag | Average |
---|---|---|---|---|---|
Llama-3-8B | 0.622 | 0.812 | 0.728 | 0.792 | 0.738 |
Llama3-tedllm-8b-v0 | 0.591 | 0.826 | 0.736 | 0.770 | 0.731 |
Llama-3-Swallow-8B-v0.1 | 0.591 | 0.824 | 0.726 | 0.772 | 0.728 |
Llama-3-ELYZA-JP-8B | 0.564 | 0.824 | 0.726 | 0.772 | 0.721 |
Model Card Contact
If you have any question, please feel free to contact [email protected].