license: apache-2.0
language:
- ja
base_model:
- Qwen/Qwen2-7B
pipeline_tag: text-generation
library_name: transformers
Moriyasu_Qwen2_JP_7B
Model Description
Moriyasu_Qwen2_JP_7B is a is a large language model trained by Moriyasu. Based on Qwen/Qwen2-7B, it has been enhanced for Japanese usage through additional pre-training and instruction tuning.
Training Datasets
Pre-training dataset
The model is continually pre-trained on Japanese data from the Qwen2-7b model while maintaining the model's English ability (80% Japanese, 20% English). We use about 120 billion tokens sampled from, Japanese and English Wikipedia articles, Japanese CC-100 Japanese C4, Japanese OSCAR ,The Pile, Webfined, Japanese websites, book data, mathematics and code,...
Instruction Tuning
We generated about 1 million Instruction data from various methods such as generated data, translated data, and data manually tagged by humans.
Model Performance
JGLUE tasks
We used the lm-evaluation-harness repo to evaluate across 8 tasks, and the results are as follows:
Model | JCommonsenseQA | JNLI | JMARC | JSQuAD | JAQKET-V2 | XL-SUM | XWINOGRAD | MGSM | JA AVG |
---|---|---|---|---|---|---|---|---|---|
3-shot | 3-shot | 0-shot | 2-shot | 1-shot | 1-shot | 0-shot | 5-shot | ||
Acc. | Balanced Acc. | Balanced Acc. | Char-F1 | Char-F1 | ROUGE-2 | Acc. | Acc. | ||
Moriyasu_Qwen2_JP_7B (OURS) | 94.91 | 91.11 | 95.50 | 87.48 | 89.24 | 19.66 | 82.38 | 55.60 | 76.99 |
Qwen2-7B-Instruct | 90.80 | 78.07 | 93.29 | 92.90 | 83.34 | 19.05 | 72.16 | 61.20 | 73.85 |
SakanaAI/EvoLLM-JP-v1-7B | 89.19 | 66.02 | 95.55 | 92.10 | 86.41 | 23.31 | 81.65 | 47.60 | 72.73 |
Llama-3-ELYZA-JP-8B | 92.40 | 64.85 | 95.67 | 92.04 | 87.43 | 21.35 | 78.21 | 49.20 | 72.64 |
Llama-3-Swallow-8B-Instruct-v0.1 | 92.49 | 62.12 | 94.27 | 93.73 | 90.83 | 19.61 | 74.04 | 50.00 | 72.14 |
Tanuki-8B-dpo-v1.0 | 79.18 | 43.05 | 92.26 | 82.29 | 77.99 | 11.68 | 70.39 | 43.60 | 62.56 |