BlueLM

🖥 github • 📜 LICENSE • 🎯 vivo Developers • 🗨 WeChat

模型介绍/Introduction

BlueLM 是由 vivo AI 全球研究院自主研发的大规模预训练语言模型，本次发布包含 7B 基础模型和 7B 对话模型，同时我们开源了支持 32K 的长文本基础模型和对话模型。

更大量的优质数据：高质量语料库进行训练，规模达到了 2.6 万亿 的 token 数，该语料库包含中文、英文以及少量日韩数据。
更优的效果：其中 BlueLM-7B-Chat 在 C-Eval 和 CMMLU 上均取得领先结果，对比同尺寸开源模型中具有较强的竞争力。
长文本支持：BlueLM-7B-Base-32K 和 BlueLM-7B-Chat-32K 均支持 32K 长文本，在保持基础能力相当情况下，能够支持更长上下文理解。
协议说明：BlueLM 系列欢迎开发者进行学术研究和商业应用。

BlueLM is a large-scale open-source language model independently developed by the vivo AI Lab. This release includes 2K and 32K context length versions for both Base and Chat models.

High-quality Data: BlueLM is trained on a high-quality data with 2.6 trillion tokens. Our train corpus mainly consists of Chinese and English data, with a small amount of Japanese and Korean data.
Stronger Performance: BlueLM-7B-Chat achieves a strong competitive performance in C-Eval and CMMLU benchmarks of the same size.
Longer Context: We have extended the context length of both BlueLM-7B-Base-32K and BlueLM-7B-Chat-32K models from 2K to 32K. The models can support longer context understanding while maintaining the same basic capabilities.
Model License: BlueLM weights are open for academic research and commercial use.

本次发布基座模型下载链接见：

The release versions and hugging face download links are listed in the table below:

	Base Model	Chat Model	4bits Quantized Chat Model
7B-2k	BlueLM-7B-Base	BlueLM-7B-Chat	BlueLM-7B-Chat-4bits
7B-32K	BlueLM-7B-Base-32K	BlueLM-7B-Chat-32K	BlueLM-7B-Chat-32K-AWQ / BlueLM-7B-Chat-32K-GPTQ

评测结果/Benchmark Results

我们在 LongBench 评测集上对我们的 BlueLM-7B-Chat-32K 模型进行了测试，具体结果如下表所示：

We tested our BlueLM-7B-Chat-32K on the LongBench dataset and the results are shown in the table below:

Model	Average	Summary	Single-Doc QA	Multi-Doc QA	Code	Few-shot	Synthetic
BlueLM-7B-Chat-32K	41.2	18.8	35.6	36.2	54.2	56.9	45.5

推理部署/Inference and Deployment

>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("vivo-ai/BlueLM-7B-Chat-32K-GPTQ", trust_remote_code=True, use_fast=False)
>>> model = AutoModelForCausalLM.from_pretrained("vivo-ai/BlueLM-7B-Chat-32K-GPTQ", device_map="cuda:0", torch_dtype=torch.float16, trust_remote_code=True, low_cpu_mem_usage=True, use_cache=False)
>>> model = model.eval()
>>> inputs = tokenizer("[|Human|]:写一篇关于刘慈欣《三体》小说的读后感，1000字左右[|AI|]:", return_tensors="pt")
>>> inputs = inputs.to("cuda:0")
>>> pred = model.generate(**inputs, max_new_tokens=2048, repetition_penalty=1.1)
>>> print(tokenizer.decode(pred.cpu()[0], skip_special_tokens=True))

更多使用说明，请参考我们的 Github 仓库。

For more instructions, please refer to our Github Repo.

协议/License

社区使用代码依照 Apache-2.0 协议开源，且使用 BlueLM 模型权重需要遵循 vivo_BlueLM模型许可协议。

Our code is licensed under the Apache-2.0 and Community License for BlueLM Model.