AIJapanese's picture
Update README.md
8b51cb6 verified
|
raw
history blame
7 kB
metadata
license: apache-2.0
language:
  - ja
pipeline_tag: text-generation
library_name: transformers

Moriyasu_Qwen2_JP_7B

Model Description

Moriyasu_Qwen2_JP_7B is a large language model trained by Moriyasu. Based on Qwen/Qwen2-7B, it has been enhanced for Japanese usage through additional pre-training and instruction tuning.

Model Performance

JGLUE tasks

We used the lm-evaluation-harness repo to evaluate across 8 tasks, and the results are as follows:

Model JCommonsenseQA JNLI JMARC JSQuAD JAQKET-V2 XL-SUM XWINOGRAD MGSM JA AVG
3-shot 3-shot 0-shot 2-shot 1-shot 1-shot 0-shot 5-shot
Acc. Balanced Acc. Balanced Acc. Char-F1 Char-F1 ROUGE-2 Acc. Acc.
Moriyasu_Qwen2_JP_7B (ours) 0.9491 0.9111 0.9550 0.8748 0.8924 0.1966 0.8238 0.5560 0.7699
Qwen2-7B-Instruct 0.9080 0.7807 0.9329 0.9290 0.8334 0.1905 0.7216 0.6120 0.7385
SakanaAI/EvoLLM-JP-v1-7B 0.8919 0.6602 0.9555 0.9210 0.8641 0.2331 0.8165 0.4760 0.7273
Llama-3-ELYZA-JP-8B 0.9240 0.6485 0.9567 0.9204 0.8743 0.2135 0.7821 0.4920 0.7264
Llama-3-Swallow-8B-Instruct-v0.1 0.9249 0.6212 0.9427 0.9373 0.9083 0.1961 0.7404 0.5000 0.7214
Tanuki-8B-dpo-v1.0 0.7918 0.4305 0.9226 0.8229 0.7799 0.1168 0.7039 0.4360 0.6256

Japanese tasks

For this evaluation, we used swallow-evaluation repo to evaluate our model. The results of other models are taken from the report Llama-3.1-Swallow-8B-Instruct-v0.2 .

Model JCom. JEMHopQA NIILC JSQuAD XL-Sum MGSM WMT20-en-ja WMT20-ja-en JMMLU JHumanEval Ja Avg
4-shot 4-shot 4-shot 4-shot 1-shot 4-shot 4-shot 4-shot 5-shot 0-shot
EM acc Char-F1 Char-F1 Char-F1 ROUGE-2 EM acc BLEU BLEU EM acc pass@1
Moriyasu_Qwen2_JP_7B (ours) 0.9321 0.4823 0.6046 0.9201 0.1382 0.5560 0.2636 0.1892 0.5273 0.2976 0.4911
RakutenAI-7B-chat 0.9035 0.2600 0.4619 0.8647 0.1339 0.2120 0.2667 0.1966 0.4504 0.2299 0.3980
Qwen2-7B-Instruct 0.8856 0.3902 0.3859 0.8967 0.1277 0.5720 0.2041 0.1909 0.5713 0.5683 0.4793
Qwen2.5-7B-Instruct 0.9151 0.4293 0.3910 0.8908 0.1676 0.6240 0.2108 0.1916 0.6252 0.5305 0.4976
Tanuki-8B-dpo-v1.0 0.2770 0.2937 0.3710 0.6669 0.1016 0.4280 0.2385 0.1820 0.3078 0.2555 0.3122
Llama 3 8B Instruct 0.8785 0.3812 0.3936 0.8955 0.1273 0.4160 0.2143 0.2035 0.4719 0.2872 0.4269
Llama 3.1 8B Instruct 0.8829 0.4272 0.4112 0.8856 0.1481 0.5280 0.2174 0.1990 0.5086 0.4976 0.4706
Llama 3 Youko 8B Instruct 0.9196 0.4850 0.5178 0.9001 0.2085 0.4680 0.2559 0.1906 0.4691 0.2695 0.4684
Llama-3-ELYZA-JP-8B 0.9017 0.5124 0.5016 0.9113 0.1677 0.4600 0.2509 0.1846 0.4829 0.3811 0.4754
Llama 3 heron brain 8B v0.3 0.9231 0.4933 0.5694 0.9056 0.2178 0.4560 0.2771 0.2168 0.4993 0.3177 0.4876
Llama 3 Swallow 8B Instruct 0.9178 0.4963 0.5168 0.9088 0.1296 0.4880 0.2522 0.2254 0.4835 0.3927 0.4811
Llama 3.1 Swallow 8B Instruct v0.1 0.9240 0.5874 0.5736 0.9170 0.1380 0.5080 0.2820 0.2282 0.5301 0.3665 0.5055
Llama 3.1 Swallow 8B Instruct v0.2 0.9294 0.5601 0.5988 0.9148 0.1372 0.5280 0.2878 0.2270 0.5504 0.4079 0.5141

Japanese MTBench

For this evaluation, we use FastChat and gpt-4o-2024-08-06 for judgement and reference answer.

Due to limited computational resources, we conducted evaluations on only a select number of models.

Model coding extraction humanities math reasoning roleplay stem writing JMTAvg
Moriyasu_Qwen2_JP_7B (ours) 0.515 0.710 0.845 0.685 0.585 0.815 0.710 0.765 0.704
Llama-3-ELYZA-JP-8B 0.365 0.72 0.730 0.400 0.555 0.670 0.580 0.785 0.601
Llama 3.1 Swallow 8B Instruct v0.1 0.480 0.680 0.705 0.475 0.425 0.710 0.620 0.645 0.592

Elyza task 100:

For this benchmark, we use Elyza task 100 dataset and gpt4o scoring prompt of Elyza. Link prompt from this blog

Model Score
Moriyasu_Qwen2_JP_7B (ours) 3.37
Llama-3-ELYZA-JP-8B 3.66
Llama 3.1 Swallow 8B Instruct v0.1 3.32

Nejumi leaderboard 3

We will contact Nejumi soon to evaluate on this benchmark

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
path = 'AIJapanese/Moriyasu_Qwen2_JP_7B'
model = AutoModelForCausalLM.from_pretrained(
    path,
    torch_dtype=torch.bfloat16, 
    device_map="auto",
    use_cache=True
)
tokenizer = AutoTokenizer.from_pretrained(path)

system_prompt = "あなたは誠実で優秀な日本人アシスタントです。常に可能な限り最も役立つ回答を提供するように努めてください。"
prompt = "日本で一番高い山は何ですか "
conversation = [{"role": "system", "content": system_prompt }]
conversation.append({"role": "user", "content": prompt})
text = tokenizer.apply_chat_template(
    conversation,
    tokenize=False,
    add_generation_prompt=True)

model_inputs = tokenizer(text,return_tensors="pt").to(model.device)
generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=2048,
    temperature = 0.2,
    #top_p=0.95,
    #top_k=40,
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

Training Datasets

Pre-training dataset

The model is continually pre-trained on Japanese data from the Qwen2-7b model while maintaining the model's English ability (80% Japanese, 20% English). We use about 120 billion tokens sampled from, Japanese and English Wikipedia articles, Japanese CC-100 Japanese C4, Japanese OSCAR ,The Pile, Webfined, Japanese websites, book data, mathematics and code,...

Instruction Tuning

We generated about 1 million Instruction data from various methods such as generated data, translated data, and data manually tagged by humans.

Contact:

If you have any questions, please contact me at: [email protected]