metadata

license: apache-2.0
language:
  - ja
base_model:
  - Qwen/Qwen2-7B
pipeline_tag: text-generation
library_name: transformers

Moriyasu_Qwen2_JP_7B

Model Description

Moriyasu_Qwen2_JP_7B is a is a large language model trained by Moriyasu. Based on Qwen/Qwen2-7B, it has been enhanced for Japanese usage through additional pre-training and instruction tuning.

Training Datasets

Pre-training dataset

The model is continually pre-trained on Japanese data from the Qwen2-7b model while maintaining the model's English ability (80% Japanese, 20% English). We use about 120 billion tokens sampled from, Japanese and English Wikipedia articles, Japanese CC-100 Japanese C4, Japanese OSCAR ,The Pile, Webfined, Japanese websites, book data, mathematics and code,...

Instruction Tuning

We generated about 1 million Instruction data from various methods such as generated data, translated data, and data manually tagged by humans.

Model Performance

JGLUE tasks

We used the lm-evaluation-harness repo to evaluate across 8 tasks, and the results are as follows:

Model	JCommonsenseQA	JNLI	JMARC	JSQuAD	JAQKET-V2	XL-SUM	XWINOGRAD	MGSM	JA AVG
	3-shot	3-shot	0-shot	2-shot	1-shot	1-shot	0-shot	5-shot
	Acc.	Balanced Acc.	Balanced Acc.	Char-F1	Char-F1	ROUGE-2	Acc.	Acc.
Moriyasu_Qwen2_JP_7B (OURS)	94.91	91.11	95.50	87.48	89.24	19.66	82.38	55.60	76.99
Qwen2-7B-Instruct	90.80	78.07	93.29	92.90	83.34	19.05	72.16	61.20	73.85
SakanaAI/EvoLLM-JP-v1-7B	89.19	66.02	95.55	92.10	86.41	23.31	81.65	47.60	72.73
Llama-3-ELYZA-JP-8B	92.40	64.85	95.67	92.04	87.43	21.35	78.21	49.20	72.64
Llama-3-Swallow-8B-Instruct-v0.1	92.49	62.12	94.27	93.73	90.83	19.61	74.04	50.00	72.14
Tanuki-8B-dpo-v1.0	79.18	43.05	92.26	82.29	77.99	11.68	70.39	43.60	62.56