Llama3-tedllm-8B-v0

Llama3-tedllm-8b-v0 is a bilingual Japanese-English generative model built through continuous pre-training from Llama-3-8B on approximately 173 billion tokens. This model is designed to enhance the Japanese language understanding and generation while preserving English proficiency of Llama-3.

How to use

Below is a sample code to use this model for text generation.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("tokyo-electron-device-ai/llama3-tedllm-8b-v0")
model = AutoModelForCausalLM.from_pretrained("tokyo-electron-device-ai/llama3-tedllm-8b-v0", device_map="auto", torch_dtype=torch.bfloat16)
text = "人工知能とは何か説明してください"
tokenized_input = tokenizer.encode(text, add_special_tokens=False, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(
        tokenized_input,
        max_new_tokens=50,
        do_sample=True,
        top_p=0.9,
        temperature=0.6,
    )[0]
print(tokenizer.decode(output))

Model Details

Developed by: TED, Cerebras Systems
Language(s): Japanese and English
Model architecture: Matches LLaMA-3 8B
License: Meta Llama 3 License
Trained from model: LLaMa-3 8B
Vocabulary size: 141,056 tokens
Context length: 8192 tokens
Input: text data
Output: model generates text

Intended Use & Limitations

Intended Use: This model is continuously pretrained using Llama-3-8B as the foundation. The model has not been exposed to instruction tuning data. It is designed for text generation tasks and can also be fine-tuned for specific downstream applications, making it suitable for a variety of users, including researchers, developers, and businesses.
Limitations: Despite its versatility, the model is trained on web-crawled datasets like mc4 and OSCAR, which may contain inaccuracies, biases, or harmful content. As a result, it can generate incorrect, misleading, or offensive outputs. Users should critically evaluate the results, especially in high-stakes or sensitive applications, and are responsible for ensuring compliance with legal and ethical standards. This model is a tool, not a source of truth, and its outputs should be verified in context.

Training Details

Training process

We follow the approach described in Bilingual Adaptation of Monolingual Foundation Models for training.

Starting with the Llama-3-8B base checkpoint, we extend the LLaMA-3 vocabulary by 10%, from 128,000 to 141,056 tokens, to increase a variety of Japanese Kanjis tokens. This improves Japanese tokenization efficiency by 21%.
We initialize newly added embeddings using similarity-based token embedding initialization. Added embedding vectors are initialized with a weighted average of embeddings of top K most similar tokens in the original LLaMA-3 vocabulary, using an external embedding.
We start with embedding-only training on 8.6B tokens, freezing the weights of all layers expect for the embedding and unembedding layers.
This is followed by full continuous pre-training on 164B tokens, where all model weights are updated.

Training data

This model was continuously trained on 173B tokens, with the training data consisting of 20% English and 80% Japanese. The raw Japanese data was filtered using scripts from llm-jp-corpus repository. The following Japanese datasets were included into the training data mixture:

Note: This released model was trained exclusively on open-source datasets. We also trained models using proprietary domain-specific data, but there are no plans to release those models.

Hyper-parameters

batch_size: 720
peak_learning_rate: 7.5e-5
optimizer: AdamW
weight_decay: 0.1
annealing_steps: 500

Note: We created another model name, llama3-tedllm-8b-v0-annealing as the model with the annealing_step applied. If you are interested, please check here.

Training Infrastructure

The model was trained on a Cerebras Wafer-Scale Cluster, using from 4 to 16 CS-3 systems during different phases of training. Training on the Cerebras Wafer-Scale Clusters leverages Cerebras' Weight Streaming execution paradigm, which simplifies the training of LLMs by disaggregating compute from memory used for model weights. This enables efficient scaling of training across multiple nodes using simple data parallelism. You can learn more about Cerebras Wafer-Scale clusters and Weight Streaming execution here.

Evaluation

We conducted a comprehensive evaluation of Llama3-tedllm-8b-v0-annealing and benchmarked it against other leading Japanese-English bilingual models. Considering evaluation results in both Japanese and English, our model performs on par with the best Japanese-English bilingual models of a similar size, while offering significantly higher tokenization efficiency, which leads to a substantial reduction in inference cost.

Japanese benchmark: llm-jp-eval==1.4.1
English benchmark: MMLU, BoolQ, Winogrande, Hellaswag

Japanese Task Result

Model	EL	FA	HE	MC	MR	NLI	QA	RC	AVG
Llama-3-8B	0.372	0.254	0.505	0.647	0.650	0.634	0.413	0.868	0.543
Llama3-tedllm-8b-v0	0.384	0.254	0.540	0.747	0.680	0.622	0.507	0.867	0.575
Llama-3-Swallow-8B-v0.1	0.407	0.277	0.525	0.750	0.720	0.612	0.522	0.890	0.588
Llama-3-ELYZA-JP-8B	0.461	0.276	0.485	0.763	0.700	0.610	0.491	0.900	0.586

English Task Result

Model	MMLU	BoolQ	Winogrande	Hellaswag	Average
Llama-3-8B	0.622	0.812	0.728	0.792	0.738
Llama3-tedllm-8b-v0	0.591	0.826	0.736	0.770	0.731
Llama-3-Swallow-8B-v0.1	0.591	0.824	0.726	0.772	0.728
Llama-3-ELYZA-JP-8B	0.564	0.824	0.726	0.772	0.721

Model Card Contact

If you have any question, please feel free to contact [email protected].

tokyo-electron-device-ai
/

llama3-tedllm-8b-v0