JiuZhang-8B Model

JiuZhang-8B model is a math-specific base model obtained by performing continued pre-training on Llama3-8B using 140B tokens (of which 100B are math-related tokens).

Features

Excellent reasoning performance: JiuZhang-8B has achieved comparable accuracy to the Qwen2.5-Math-7B base model on the evaluation sets of four types of math problems: GSM8K, MATH, GAOKAO, and ZHONGKAO. It surpasses base models with more than 70B parameters such as LLaMa3.1-70B and Qwen2-72B.
Good general capabilities: JiuZhang-8B has obtained a score of 0.622 on MMLU, which is consistent with the performance of the base model Llama3-8B. It maintains general performance on tasks other than math reasoning.
Self-correction ability: JiuZhang-8B can self-check and correct errors in the reasoning process. This is the result of using a large proportion of synthetic data. It has not undergone any post-training process and can perform instruction fine-tuning or format training as needed.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"  # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)


example = "Find $x$ such that $\\lceil x \\rceil + x = \\dfrac{23}{7}$. Express $x$ as a common fraction."
prompt = f"Solve the following problem step by step. Question: {example}\nSolution:" 


model_inputs = tokenizer([prompt], return_tensors="pt").to(device)

generated_ids = model.generate(model_inputs.input_ids, temperature=0, max_length=2048)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0]

print(response)

Perfermance

We have compared the performance of JiuZhang-8B on the evaluation sets of four types of math problems: GSM8K, MATH, GAOKAO, and ZHONGKAO with popular models.

The reasoning results in the following table are all based on greedy decoding, and the better one under the zero-shot and few-shot settings is used as the accuracy rate of the data set.
We use a compare model to compare the answers of the reasoning results.
In addition, we provide an arithmetic average of the accuracy rates of the four data sets for comparison.

Model	GSM8K	Math	Gaokao	Zhongkao	Average
General Model
Meta-Llama-3-8B	58.38	17.04	13.62	42.61	32.91
Meta-Llama-3-70B	82.34	38.42	28.09	64.02	53.21
Meta-Llama-3.1-8B	56.79	19.70	11.49	44.70	33.17
Meta-Llama-3.1-70B	81.73	39.66	31.06	64.77	54.31
Qwen2-7B	80.44	47.82	27.23	70.45	56.49
Qwen2-72B	86.58	56.88	45.11	73.67	65.56
Qwen2.5-7B	84.61	53.22	45.53	80.30	65.92
Qwen2.5-72B	90.60	59.38	56.60	82.95	72.38
Specific Model
Llemma-7B	41.47	18.94	14.89	45.08	30.10
Deepseek-Math-7B-Base	65.73	33.40	23.83	62.69	46.41
Qwen2-Math-7B	80.67	53.02	42.13	77.08	63.22
Qwen2-Math-72B	88.63	61.88	51.91	81.25	70.92
Qwen2.5-Math-7B	85.44	59.10	53.19	78.79	69.13
Qwen2.5-Math-72B	88.70	67.10	53.62	81.63	72.76
JiuZhang-8B	81.20	60.38	60.43	80.49	70.62

Acknowledgements

Thanks to all contributors who have helped in developing this model.