JiuZhang-8B Model

JiuZhang-8B model is a math-specific base model obtained by performing continued pre-training on Llama3-8B using 140B tokens (of which 100B are math-related tokens).

Features

  • Excellent reasoning performance: JiuZhang-8B has achieved comparable accuracy to the Qwen2.5-Math-7B base model on the evaluation sets of four types of math problems: GSM8K, MATH, GAOKAO, and ZHONGKAO. It surpasses base models with more than 70B parameters such as LLaMa3.1-70B and Qwen2-72B.

  • Good general capabilities: JiuZhang-8B has obtained a score of 0.622 on MMLU, which is consistent with the performance of the base model Llama3-8B. It maintains general performance on tasks other than math reasoning.

  • Self-correction ability: JiuZhang-8B can self-check and correct errors in the reasoning process. This is the result of using a large proportion of synthetic data. It has not undergone any post-training process and can perform instruction fine-tuning or format training as needed.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"  # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)


example = "Find $x$ such that $\\lceil x \\rceil + x = \\dfrac{23}{7}$. Express $x$ as a common fraction."
prompt = f"Solve the following problem step by step. Question: {example}\nSolution:" 


model_inputs = tokenizer([prompt], return_tensors="pt").to(device)

generated_ids = model.generate(model_inputs.input_ids, temperature=0, max_length=2048)
generated_ids = [
    output_ids[len(input_ids) :]
    for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=False)[0]

print(response)

Perfermance

We have compared the performance of JiuZhang-8B on the evaluation sets of four types of math problems: GSM8K, MATH, GAOKAO, and ZHONGKAO with popular models.

  • The reasoning results in the following table are all based on greedy decoding, and the better one under the zero-shot and few-shot settings is used as the accuracy rate of the data set.
  • We use a compare model to compare the answers of the reasoning results.
  • In addition, we provide an arithmetic average of the accuracy rates of the four data sets for comparison.
Model GSM8K Math Gaokao Zhongkao Average
General Model
Meta-Llama-3-8B 58.38 17.04 13.62 42.61 32.91
Meta-Llama-3-70B 82.34 38.42 28.09 64.02 53.21
Meta-Llama-3.1-8B 56.79 19.70 11.49 44.70 33.17
Meta-Llama-3.1-70B 81.73 39.66 31.06 64.77 54.31
Qwen2-7B 80.44 47.82 27.23 70.45 56.49
Qwen2-72B 86.58 56.88 45.11 73.67 65.56
Qwen2.5-7B 84.61 53.22 45.53 80.30 65.92
Qwen2.5-72B 90.60 59.38 56.60 82.95 72.38
Specific Model
Llemma-7B 41.47 18.94 14.89 45.08 30.10
Deepseek-Math-7B-Base 65.73 33.40 23.83 62.69 46.41
Qwen2-Math-7B 80.67 53.02 42.13 77.08 63.22
Qwen2-Math-72B 88.63 61.88 51.91 81.25 70.92
Qwen2.5-Math-7B 85.44 59.10 53.19 78.79 69.13
Qwen2.5-Math-72B 88.70 67.10 53.62 81.63 72.76
JiuZhang-8B 81.20 60.38 60.43 80.49 70.62

Acknowledgements

Thanks to all contributors who have helped in developing this model.

Downloads last month
5
Safetensors
Model size
8.03B params
Tensor type
BF16
·
Inference API
Unable to determine this model's library. Check the docs .