--- license: llama3 language: - ja - en base_model: - meta-llama/Meta-Llama-3-8B --- # Llama3-tedllm-8B-v0 Llama3-tedllm-8b-v0 is a bilingual Japanese-English generative model built through continuous pre-training from [Llama-3-8B](https://huggingface.co/meta-llama/Meta-Llama-3-8B) on approximately 173 billion tokens. This model is designed to enhance the Japanese language understanding and generation while preserving English proficiency of Llama-3. # How to use Below is a sample code to use this model for text generation. ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("tokyo-electron-device-ai/llama3-tedllm-8b-v0") model = AutoModelForCausalLM.from_pretrained("tokyo-electron-device-ai/llama3-tedllm-8b-v0", device_map="auto", torch_dtype=torch.bfloat16) text = "人工知能とは何か説明してください" tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device) with torch.no_grad(): output = model.generate( tokenized_input, max_new_tokens=50, do_sample=True, top_p=0.9, temperature=0.6, )[0] print(tokenizer.decode(output)) ``` # Model Details * **Developed by**: TED, Cerebras Systems * **Language(s)**: Japanese and English * **Model architecture**: Matches LLaMA-3 8B * **License**: Meta Llama 3 License * **Trained from model**: LLaMa-3 8B * **Vocabulary size**: 141,056 tokens * **Context length**: 8192 tokens * **Input**: text data * **Output**: model generates text # Intended Use & Limitations * **Intended Use**: This model is continuously pretrained using Llama-3-8B as the foundation. The model has not been exposed to instruction tuning data. It is designed for text generation tasks and can also be fine-tuned for specific downstream applications, making it suitable for a variety of users, including researchers, developers, and businesses. * **Limitations**: Despite its versatility, the model is trained on web-crawled datasets like mc4 and OSCAR, which may contain inaccuracies, biases, or harmful content. As a result, it can generate incorrect, misleading, or offensive outputs. Users should critically evaluate the results, especially in high-stakes or sensitive applications, and are responsible for ensuring compliance with legal and ethical standards. This model is a tool, not a source of truth, and its outputs should be verified in context. # Training Details ### Training process We follow the approach described in [Bilingual Adaptation of Monolingual Foundation Models](https://arxiv.org/abs/2407.12869) for training. - Starting with the Llama-3-8B base checkpoint, we extend the LLaMA-3 vocabulary by 10%, from 128,000 to 141,056 tokens, to increase a variety of Japanese Kanjis tokens. This improves Japanese tokenization efficiency by 21%. - We initialize newly added embeddings using similarity-based token embedding initialization. Added embedding vectors are initialized with a weighted average of embeddings of top K most similar tokens in the original LLaMA-3 vocabulary, using an external embedding. - We start with embedding-only training on 8.6B tokens, freezing the weights of all layers expect for the embedding and unembedding layers. - This is followed by full continuous pre-training on 164B tokens, where all model weights are updated. ### Training data This model was continuously trained on 173B tokens, with the training data consisting of 20% English and 80% Japanese. The raw Japanese data was filtered using scripts from [llm-jp-corpus repository](https://github.com/llm-jp/llm-jp-corpus). The following Japanese datasets were included into the training data mixture: * **[legacy-datasets/mc4](https://huggingface.co/datasets/legacy-datasets/mc4)** * **[range3/cc100-ja](https://huggingface.co/datasets/range3/cc100-ja)** * **[if001/oscar_2023_filtered](https://huggingface.co/datasets/if001/oscar_2023_filtered)** * **[dumps.wikimedia.org](https://dumps.wikimedia.org/)** Note: This released model was trained exclusively on open-source datasets. We also trained models using proprietary domain-specific data, but there are no plans to release those models. ### Hyper-parameters * **batch_size**: 720 * **peak_learning_rate**: 7.5e-5 * **optimizer**: AdamW * **weight_decay**: 0.1 * **annealing_steps**: 500 Note: We created another model name, llama3-tedllm-8b-v0-annealing as the model with the annealing_step applied. If you are interested, please check [here](https://huggingface.co/tokyo-electron-device-ai/llama3-tedllm-8b-v0-annealing). ### Training Infrastructure The model was trained on a Cerebras Wafer-Scale Cluster, using from 4 to 16 CS-3 systems during different phases of training. Training on the Cerebras Wafer-Scale Clusters leverages Cerebras' Weight Streaming execution paradigm, which simplifies the training of LLMs by disaggregating compute from memory used for model weights. This enables efficient scaling of training across multiple nodes using simple data parallelism. You can learn more about Cerebras Wafer-Scale clusters and Weight Streaming execution [here](https://8968533.fs1.hubspotusercontent-na1.net/hubfs/8968533/Virtual%20Booth%20Docs/CS%20Weight%20Streaming%20White%20Paper.pdf). ### Evaluation We conducted a comprehensive evaluation of [Llama3-tedllm-8b-v0-annealing](https://huggingface.co/tokyo-electron-device-ai/llama3-tedllm-8b-v0-annealing) and benchmarked it against other leading Japanese-English bilingual models. Considering evaluation results in both Japanese and English, our model performs on par with the best Japanese-English bilingual models of a similar size, while offering significantly higher tokenization efficiency, which leads to a substantial reduction in inference cost. - Japanese benchmark: [llm-jp-eval==1.4.1](https://github.com/llm-jp/llm-jp-eval/tree/v1.4.1%5D) - English benchmark: MMLU, BoolQ, Winogrande, Hellaswag #### Japanese Task Result |Model|EL|FA|HE|MC|MR|NLI|QA|RC|AVG| |---|---|---|---|---|---|---|---|---|---| | Llama-3-8B | 0.372 | 0.254 | 0.505 | 0.647 | 0.650 | 0.634 | 0.413 | 0.868 | 0.543 | | Llama3-tedllm-8b-v0 | 0.384 | 0.254 | 0.540 | 0.747 | 0.680 | 0.622 | 0.507 | 0.867 | 0.575 | | Llama-3-Swallow-8B-v0.1 | 0.407 | 0.277 | 0.525 | 0.750 | 0.720 | 0.612 | 0.522 | 0.890 | 0.588 | | Llama-3-ELYZA-JP-8B | 0.461 | 0.276 | 0.485 | 0.763 | 0.700 | 0.610 | 0.491 | 0.900 | 0.586 | #### English Task Result |Model| MMLU | BoolQ | Winogrande | Hellaswag | Average | |---|---|---|---|---|---| | Llama-3-8B | 0.622 | 0.812 | 0.728 | 0.792 | 0.738 | | Llama3-tedllm-8b-v0 | 0.591 | 0.826 | 0.736 | 0.770 | 0.731 | | Llama-3-Swallow-8B-v0.1 | 0.591 | 0.824 | 0.726 | 0.772 | 0.728 | | Llama-3-ELYZA-JP-8B | 0.564 | 0.824 | 0.726 | 0.772 | 0.721 | # Model Card Contact If you have any question, please feel free to contact cerebras-sup@teldevice.co.jp.