YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

灵智大模型 - 垂直领域行业专家

🌐 官方网站,欢迎访问

✨ 亮点

  • 从Qwen2-base完美复现了Qwen2-chat,并公开了训练数据;
  • 在垂类领域训练场景下,灵智模型能够在提升垂类领域性能的同时也保持了通用领域的性能;
  • 对多种训练范式(例如直接指令微调,先持续预训练再指令微调等八种范式)做了总结,并针对不同的模型大小采取了最佳的训练范式;
  • 开源了8个灵智模型:Lingzhi-0.5B-chat, Lingzhi-0.8B-chat, Lingzhi-1.5B-chat, Lingzhi-2.7B-chat, Lingzhi-7B-chat, Lingzhi-10B-chat, Lingzhi-57MOE14B-chat, Lingzhi-72B-chat.

📄 摘要

在实际应用中,当预训练数据不可用时,进行持续训练是很常见的。然而,持续训练往往会在增强领域特定技能的同时导致大语言模型(LLMs)灾难性地遗忘其通用能力。在本文中,我们首先对常见的持续训练范式进行了实证研究,然后选择了最佳范式来训练灵智系列模型。实验表明,灵智能够在保持通用能力的同时增强领域特定的性能。我们已经开源了所有模型、训练数据和基准测试,用户可以将它们应用到自己的领域特定区域。

📘 介绍

大语言模型(LLMs)近年来因其在各种实际下游任务中的出色表现而备受关注。实际上,尽管现有的LLMs在通用领域表现良好,但由于在预训练或指令微调期间缺乏特定领域的专业暴露,它们可能在用户需要的特定领域(如会计、法律、金融)中表现不佳。

为了提升LLMs在特定领域的表现,我们需要收集相应的数据进行持续学习,如持续预训练(CPT)或有监督微调(SFT)。然而,我们注意到,仅在特定领域进行持续学习可能导致通用能力的灾难性遗忘,如规划、指令执行、数学、编程和自然语言理解等。

为了同时保持通用和领域特定能力,通常会部署一个未修改的原生模型用于通用任务,而一个微调模型用于专业任务。这将对计算硬件资源(如GPU和内存)提出巨大的需求,从而阻碍商业部署。众所周知,上述现象是业界面临的一个非常棘手的问题。因此,一个值得研究的问题出现了:如何在持续学习过程中提高领域特定的表现,而不损害通用能力?

为了解决这个问题,我们进行了实证研究,探索了各种持续学习范式并总结了它们的优缺点。最终,在实证研究之后,我们选择了最佳的学习范式和训练数据,基于Qwen2-base进行持续学习,衍生出我们的灵智系列模型。经过大量实验,灵智能够在多个特定领域中表现出色,同时在通用能力方面也表现出与原始Qwen2-chat模型相当的性能。

📋 示例

  1. huggingface示例代码
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

lingzhi_model_path = "Lingzhi-AI/Lingzhi-7B-chat"

model = AutoModelForCausalLM.from_pretrained(
    lingzhi_model_path,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(lingzhi_model_path)

prompt = "帮我介绍一下灵智大模型。"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
  1. modelscope示例代码
from modelscope import AutoModelForCausalLM, AutoTokenizer
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

lingzhi_model_path = "LingzhiLLM/Lingzhi-7B-chat"

model = AutoModelForCausalLM.from_pretrained(
    lingzhi_model_path,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(lingzhi_model_path)

prompt = "帮我介绍一下灵智大模型。"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

📊 结果

备注:Baselines中Qwen2的所有结果均是在我们统一的环境下进行评测的。

Base Model General Domains Avg.
English Chinese Math Code
MMLU BBH C-Eval CMMLU GSM8K MathQA HumanEval MBPP Account Law
Baselines
Qwen2-0.5B-chat 43.30 10.35 54.16 53.57 33.97 25.76 20.73 12.40 17.01 25.00 29.62
Qwen2-1.5B-chat 55.73 9.55 69.32 70.13 54.21 32.93 42.68 20.60 32.65 42.07 42.99
Qwen2-7B-chat 69.82 30.56 81.58 81.77 66.26 44.09 72.56 42.20 55.10 59.15 60.31
Qwen2-57MOE14B-chat
Qwen2-72B-chat
Lingzhi Models
Lingzhi-0.5B-chat 44.25 25.65 55.05 53.74 29.34 29.18 25.00 22.40 25.85 40.24 35.07
Lingzhi-0.8B-chat 42.93 27.77 53.34 50.98 21.00 28.84 28.66 18.00 24.49 40.85 33.69
Lingzhi-1.5B-chat 55.35 33.67 69.47 69.10 49.58 35.31 39.02 31.00 37.41 42.68 46.26
Lingzhi-2.7B-chat 53.65 36.77 67.09 67.39 46.02 34.51 40.85 30.00 38.10 60.98 47.54
Lingzhi-7B-chat 69.06 58.95 82.69 83.05 74.22 45.59 56.10 49.80 72.79 89.02 68.13
Lingzhi-10B-chat 69.37 64.37 81.50 82.27 76.19 46.00 60.98 50.40 70.07 82.93 68.41
Lingzhi-57MOE14B-chat
Lingzhi-72B-chat

📚 引用

⚠️ 警告 如果您用到了我们的模型和数据,请使用以下参考文献。

@misc{lingzhi,
      title={Lingzhi: Improving Domain-Specific Performance without Compromising General Capabilities}, 
      author={Daoguang Zan, Lei Yu, Ailun Yu, Zhirong Huang, Zongshuai Ruan, Pengjie Huang},
      year={2024},
      note={All authors contributed equally. The computational power required to train the Lingzhi models (12*8 H800 80G) was provided by Lingzhi AI. Special thanks to them.}
}
Downloads last month
8
Safetensors
Model size
57.4B params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Lingzhi-AI/Lingzhi-57B-chat

Quantizations
1 model