vivo-ai
/

BlueLM-7B-Base

@@ -22,7 +22,7 @@ BlueLM 是由 vivo AI 全球研究院自主研发的大规模预训练语言模
 BlueLM is a large-scale open-source language model independently developed by the vivo AI Lab. This release includes 2K and 32K context length versions for both Base and Chat models.
 - **High-quality Data**: BlueLM is trained on a high-quality data with 2.6 trillion tokens. Our train corpus contains Chinese, English, Japanese and Korean data.
-- **Stronger Performance**: BlueLM-7B-Chat achieves the best performance in C-Eval and CMMLU benchmarks of the same size.
 - **Longer Context**: We have extended the context length of both BlueLM-7B-Base-32K and BlueLM-7B-Chat-32K models from 2K to 32K. The models can support longer context understanding while maintaining the same basic capabilities.
 - **Model License**: BlueLM weights are open for academic research and commercial use.
@@ -32,9 +32,26 @@ The release versions and hugging face download links are listed in the table bel
 |     |          Base Model        |          Chat Model        |       4bits Quantized Chat Model        |
 |:---:|:--------------------:|:--------------------:|:--------------------------:|
-| 7B  | [BlueLM-7B-Base](https://huggingface.co/vivo-ai/BlueLM-7B-Base)  | [BlueLM-7B-Chat](https://huggingface.co/vivo-ai/BlueLM-7B-Chat)  | [BlueLM-7B-Chat-4bits](https://huggingface.co/vivo-ai/BlueLM-7B-Chat-4bits)  |
 | 7B-32K | [BlueLM-7B-Base-32K](https://huggingface.co/vivo-ai/BlueLM-7B-Base-32K) | [BlueLM-7B-Chat-32K](https://huggingface.co/vivo-ai/BlueLM-7B-Chat-32K) | - |
 ## 推理部署/Inference and Deployment
 ```python

 BlueLM is a large-scale open-source language model independently developed by the vivo AI Lab. This release includes 2K and 32K context length versions for both Base and Chat models.
 - **High-quality Data**: BlueLM is trained on a high-quality data with 2.6 trillion tokens. Our train corpus contains Chinese, English, Japanese and Korean data.
+- **Stronger Performance**: BlueLM-7B-Chat achieves a strong competitive performance in C-Eval and CMMLU benchmarks of the same size.
 - **Longer Context**: We have extended the context length of both BlueLM-7B-Base-32K and BlueLM-7B-Chat-32K models from 2K to 32K. The models can support longer context understanding while maintaining the same basic capabilities.
 - **Model License**: BlueLM weights are open for academic research and commercial use.
 |     |          Base Model        |          Chat Model        |       4bits Quantized Chat Model        |
 |:---:|:--------------------:|:--------------------:|:--------------------------:|
+| 7B-2k  | [BlueLM-7B-Base](https://huggingface.co/vivo-ai/BlueLM-7B-Base)  | [BlueLM-7B-Chat](https://huggingface.co/vivo-ai/BlueLM-7B-Chat)  | [BlueLM-7B-Chat-4bits](https://huggingface.co/vivo-ai/BlueLM-7B-Chat-4bits)  |
 | 7B-32K | [BlueLM-7B-Base-32K](https://huggingface.co/vivo-ai/BlueLM-7B-Base-32K) | [BlueLM-7B-Chat-32K](https://huggingface.co/vivo-ai/BlueLM-7B-Chat-32K) | - |
+## 评测结果/Benchmark Results
+为了保证模型评测的一致性，我们采用 [opencompass](https://opencompass.org.cn/leaderboard-llm) 进行相关榜单的评测。我们分别在 C-Eval、MMLU、CMMLU、GaoKao、AGIEval、BBH、GSM8K、MATH 和 HumanEval 榜单对 BlueLM 的通用能力、数学能力和代码能力进行了测试。
+To ensure the consistency of model evaluation, we use [OpenCompass](https://opencompass.org.cn/leaderboard-llm) to evaluate the performance on relevant leaderboards. We conducted extensive tests on C-Eval, MMLU, CMMLU, GaoKao, AGIEval, BBH, GSM8K, MATH and HumanEval datasets across general ability, mathematical ability and coding ability.
+| Model             | **C-Eval** | **MMLU** | **CMMLU** | **Gaokao** | **AGIEval** | **BBH** | **GSM8K** | **MATH** | **HumanEval** |
+|:------------------|:-----------|:---------|:----------|:-----------|:------------|:--------|:----------|:---------|:--------------|
+|                   | 5-shot     | 5-shot   | 5-shot    | 0-shot     | 0-shot      | 3-shot  | 4-shot    | 5-shot   | 0-shot        |
+| GPT-4             | 69.9       | 86.4     | 71.2      | 72.3       | 55.1        | 86.7    | 91.4      | 45.8     | 74.4          |
+| ChatGPT           | 52.5       | 70.0     | 53.9      | 51.1       | 39.9        | 70.1    | 78.2      | 28       | 73.2          |
+| LLaMA2-7B         | 32.5       | 45.3     | 31.8      | 18.9       | 21.8        | 38.2    | 16.7      | 3.3      | 12.8          |
+| ChatGLM2-6B(Base) | 51.7       | 47.9     | 50.0      | -          | -           | 33.7    | 32.4      | 6.5      | -             |
+| Baichuan2-7B      | 56.3       | 54.7     | 57.0      | 34.8       | 34.6        | 41.8    | 24.6      | 5.4      | 17.7          |
+| BlueLM-7B-Base    | 67.5       | 55.2     | 66.6      | 58.9       | 43.4        | 41.7    | 27.2      | 6.2      | 18.3          |
+| BlueLM-7B-Chat    | 72.7       | 50.7     | 74.2      | 48.7       | 43.4        | 65.6    | 51.9      | 13.4     | 21.3          |
 ## 推理部署/Inference and Deployment
 ```python