BlueLM

🖥 github • 📜 LICENSE • 🎯 vivo Developers • 🗨 WeChat

模型介绍/Introduction

BlueLM 是由 vivo AI 全球研究院自主研发的大规模预训练语言模型，本次发布包含 7B 基础模型和 7B 对话模型，同时我们开源了支持 32K 的长文本基础模型和对话模型。

更大量的优质数据：高质量语料库进行训练，规模达到了 2.6 万亿 的 token 数，该语料库包含中文、英文以及少量日韩数据。
更优的效果：其中 BlueLM-7B-Chat 在 C-Eval 和 CMMLU 上均取得领先结果，对比同尺寸开源模型中具有较强的竞争力。
长文本支持：BlueLM-7B-Base-32K 和 BlueLM-7B-Chat-32K 均支持 32K 长文本，在保持基础能力相当情况下，能够支持更长上下文理解。
协议说明：BlueLM 系列欢迎开发者进行学术研究和商业应用。

BlueLM is a large-scale open-source language model independently developed by the vivo AI Lab. This release includes 2K and 32K context length versions for both Base and Chat models.

High-quality Data: BlueLM is trained on a high-quality data with 2.6 trillion tokens. Our train corpus mainly consists of Chinese and English data, with a small amount of Japanese and Korean data.
Stronger Performance: BlueLM-7B-Chat achieves a strong competitive performance in C-Eval and CMMLU benchmarks of the same size.
Longer Context: We have extended the context length of both BlueLM-7B-Base-32K and BlueLM-7B-Chat-32K models from 2K to 32K. The models can support longer context understanding while maintaining the same basic capabilities.
Model License: BlueLM weights are open for academic research and commercial use.

本次发布基座模型下载链接见：

The release versions and hugging face download links are listed in the table below:

	Base Model	Chat Model	4bits Quantized Chat Model
7B-2k	BlueLM-7B-Base	BlueLM-7B-Chat	BlueLM-7B-Chat-4bits
7B-32K	BlueLM-7B-Base-32K	BlueLM-7B-Chat-32K	-

评测结果/Benchmark Results

我们在 LongBench 评测集上对我们的 BlueLM-7B-Chat-32K 模型进行了测试，具体结果如下表所示：

We tested our BlueLM-7B-Chat-32K on the LongBench dataset and the results are shown in the table below:

Model	Average	Summary	Single-Doc QA	Multi-Doc QA	Code	Few-shot	Synthetic
BlueLM-7B-Chat-32K	41.2	18.8	35.6	36.2	54.2	56.9	45.5

推理部署/Inference and Deployment

>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("vivo-ai/BlueLM-7B-Base-32K", trust_remote_code=True, use_fast=False)
>>> model = AutoModelForCausalLM.from_pretrained("vivo-ai/BlueLM-7B-Base-32K", device_map="cuda:0", torch_dtype=torch.bfloat16, trust_remote_code=True)
>>> model = model.eval()
>>> inputs = tokenizer("儒林外史->吴敬梓\n隋唐演义->褚人获\n红楼梦->", return_tensors="pt")
>>> inputs = inputs.to("cuda:0")
>>> pred = model.generate(**inputs, max_new_tokens=64, repetition_penalty=1.1)
>>> print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))
儒林外史->吴敬梓
隋唐演义->褚人获
红楼梦->曹雪芹
三国演义->罗贯中
水浒传->施耐庵
西游记->吴承恩
聊斋志异->蒲松龄
金瓶梅->兰陵笑笑生
封神演义->许仲琳
三言二拍->冯梦龙
东周列国志->冯梦龙

更多使用说明，请参考我们的 Github 仓库。

For more instructions, please refer to our Github Repo.

协议/License

为了使本项目更加开放、灵活，服务更多开发者与用户，自2024年12月25日起，本项目的大模型开源许可证进行了一次重要更新，由原vivo_BlueLM模型许可协议变更为开放原子模型许可证。

To make this project more open and flexible, serving more developers and users, starting from December 25, 2024, there will be a significant update to the open-source license of the large model for this project. It will change from the Community License for BlueLM Model to the OpenAtom Model License.

基于全新的大模型开源许可证，使用者可以在更少的限制下使用、修改和分发本项目的大模型。请确保您阅读并理解新的许可证内容。我们欢迎任何对这一变化的反馈，您可以通过邮件（[email protected]）与我们联系。

Based on the newly introduced open-source license for the large model, users can use, modify, and distribute this project's large model with fewer restrictions. Please ensure that you read and understand the new license. We welcome any feedback regarding this change, and you can contact us via email ([email protected]).

感谢您对本项目的支持！

Thank you for your support of this project!