File size: 10,717 Bytes
af828bb
 
 
 
 
 
 
 
 
 
e393c99
af828bb
e393c99
af828bb
 
 
 
e393c99
af828bb
 
 
e393c99
 
af828bb
 
e393c99
af828bb
 
 
 
 
e393c99
af828bb
89f0857
af828bb
e393c99
 
af828bb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e393c99
af828bb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e393c99
af828bb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e393c99
b0b2ab6
 
af828bb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e393c99
af828bb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e393c99
af828bb
 
 
 
 
 
 
 
 
 
 
 
 
 
e393c99
af828bb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e393c99
af828bb
 
 
 
 
 
 
 
 
 
 
 
 
 
e393c99
af828bb
 
 
 
e393c99
 
af828bb
 
e393c99
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
---
language:
- zh
- en
base_model: openbmb/MiniCPM-1B-sft-bf16
pipeline_tag: text-classification
tags:
- sentence-transformers
library_name: transformers
---
## MiniCPM-Reranker-Light

**MiniCPM-Reranker-Light** 是面壁智能与清华大学自然语言处理实验室(THUNLP)、东北大学信息检索小组(NEUIR)共同开发的中英双语言文本重排序模型,有如下特点:
- 出色的中文、英文重排序能力。
- 出色的中英跨语言重排序能力。
- 支持长文本(最长8192token)。

MiniCPM-Reranker-Light 基于 [MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16) 训练,结构上采取双向注意力。采取多阶段训练方式,共使用包括开源数据、机造数据、闭源数据在内的约 500 万条训练数据。

欢迎关注 UltraRAG 系列:

- 检索模型:[MiniCPM-Embedding-Light](https://huggingface.co/openbmb/MiniCPM-Embedding-Light)
- 重排模型:[MiniCPM-Reranker-Light](https://huggingface.co/openbmb/MiniCPM-Reranker-Light)
- 领域自适应RAG框架:[UltraRAG](https://github.com/openbmb/UltraRAG)

**MiniCPM-Reranker-Light** is a bilingual & cross-lingual text re-ranking model developed by ModelBest Inc. , THUNLP and NEUIR , featuring:

- Exceptional Chinese and English re-ranking capabilities.
- Outstanding cross-lingual re-ranking capabilities between Chinese and English.
- Long-text support (up to 8192 tokens).

MiniCPM-Reranker-Light is trained based on [MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16) and incorporates bidirectional attention in its architecture. The model underwent multi-stage training using approximately 6 million training examples, including open-source, synthetic, and proprietary data.

We also invite you to explore the UltraRAG series:

- Retrieval Model: [MiniCPM-Embedding-Light](https://huggingface.co/openbmb/MiniCPM-Embedding-Light)
- Re-ranking Model: [MiniCPM-Reranker-Light](https://huggingface.co/openbmb/MiniCPM-Reranker-Light)
- Domain Adaptive RAG Framework: [UltraRAG](https://github.com/openbmb/UltraRAG)


## 模型信息 Model Information

- 模型大小:1.2B
- 最大输入token数:8192

- Model Size: 1.2B
- Max Input Tokens: 8192

## 使用方法 Usage

### 输入格式 Input Format

本模型支持指令,输入格式如下:

MiniCPM-Reranker-Light supports instructions in the following format:

```
<s>Instruction: {{ instruction }} Query: {{ query }}</s>{{ document }}
```

例如:

For example:

```
<s>Instruction: 为这个医学问题检索相关回答。Query: 咽喉癌的成因是什么?</s>(文档省略)
```

```
<s>Instruction: Given a claim about climate change, retrieve documents that support or refute the claim. Query: However the warming trend is slower than most climate models have forecast.</s>(document omitted)
```

也可以不提供指令,即采取如下格式:

MiniCPM-Reranker-Light also works in instruction-free mode in the following format:

```
<s>Query: {{ query }}</s>{{ document }}
```

我们在BEIR与C-MTEB/Retrieval上测试时使用的指令见 `instructions.json`,其他测试不使用指令。

When running evaluation on BEIR and C-MTEB/Retrieval, we use instructions in `instructions.json`. For other evaluations, we do not use instructions. 

### 环境要求 Requirements

```
transformers==4.37.2
```

### 示例脚本 Demo

#### Huggingface Transformers

```python
from transformers import AutoModelForSequenceClassification
import torch

model_name = "OpenBMB/MiniCPM-Reranker-Light"
model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
# You can also use the following code to use flash_attention_2
# model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True,attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
model.eval()

query = "中国的首都是哪里?" # "Where is the capital of China?"
passages = ["beijing", "shanghai"] # 北京,上海

rerank_score = model.rerank(query, passages,query_instruction="Query:", batch_size=32, max_length=1024)
print(rerank_score) #[0.01791382 0.00024533]


sentence_pairs = [[f"Query: {query}", doc] for doc in passages]
scores = model.compute_score(sentence_pairs, batch_size=32, max_length=1024)
print(scores) #[0.01791382 0.00024533]
```

#### Sentence Transformer

```python
from sentence_transformers import CrossEncoder
from transformers import LlamaTokenizer
import torch

model_name = "OpenBMB/MiniCPM-Reranker-Light"
model = CrossEncoder(model_name,max_length=1024,trust_remote_code=True, automodel_args={"torch_dtype": torch.float16})
# You can also use the following code to use flash_attention_2
#model = CrossEncoder(model_name,max_length=1024,trust_remote_code=True, automodel_args={"attn_implementation":"flash_attention_2","torch_dtype": torch.float16})
model.tokenizer.padding_side = "right"

query = "中国的首都是哪里?" # "Where is the capital of China?"
passages = ["beijing", "shanghai"] # 北京,上海

INSTRUCTION = "Query: "
query = INSTRUCTION + query

sentence_pairs = [[query, doc] for doc in passages]

scores = model.predict(sentence_pairs, convert_to_tensor=True).tolist()
rankings = model.rank(query, passages, return_documents=True, convert_to_tensor=True)

print(scores) # [0.017913818359375, 0.0002453327178955078]
for ranking in rankings:
    print(f"Score: {ranking['score']:.4f}, Corpus: {ranking['text']}")
  
# Score: 0.0179, Corpus: beijing
# Score: 0.0002, Corpus: shanghai
```

#### Infinity

```python
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
query = "中国的首都是哪里?" # "What is the capital of China?"
docs = ["beijing", "shanghai"] # "北京", "上海"

INSTRUCTION = "Query:"
query = f"{INSTRUCTION} {query}"

array = AsyncEngineArray.from_args(
  [EngineArgs(model_name_or_path = "OpenBMB/MiniCPM-Reranker-Light", engine="torch", dtype="float16", bettertransformer=False, trust_remote_code=True, model_warmup=False)]
)

async def rerank(engine: AsyncEmbeddingEngine): 
    async with engine:
        ranking, usage = await engine.rerank(query=query, docs=docs)
        print(list(zip(ranking, docs)))

asyncio.run(rerank(array[0])) # [(RerankReturnType(relevance_score=0.017917344, document='beijing', index=0), 'beijing'), (RerankReturnType(relevance_score=0.00024729347, document='shanghai', index=1), 'shanghai')]
```

#### FlagEmbedding

```python
from FlagEmbedding import FlagReranker
model_name = "OpenBMB/MiniCPM-Reranker-Light"
model = FlagReranker(model_name, use_fp16=True, query_instruction_for_rerank="Query: ", trust_remote_code=True)
# You can hack the __init__() method of the FlagEmbedding BaseReranker class to use flash_attention_2 for faster inference
#  self.model = AutoModelForSequenceClassification.from_pretrained(
#             model_name_or_path,
#             trust_remote_code=trust_remote_code,
#             cache_dir=cache_dir,
#             # torch_dtype=torch.float16, # we need to add this line to use fp16
#             # attn_implementation="flash_attention_2", # we need to add this line to use flash_attention_2
#         )
model.tokenizer.padding_side = "right"

query = "中国的首都是哪里?" # "Where is the capital of China?"
passages = ["beijing", "shanghai"] # 北京,上海

sentence_pairs = [[query, doc] for doc in passages]

scores = model.compute_score(sentence_pairs,normalize=True)
print(scores) # [0.01791734476747132, 0.0002472934613244585]
```

## 实验结果 Evaluation Results

### 中文与英文重排序结果 CN/EN Re-ranking Results

中文对`bge-large-zh-v1.5`检索的top-100进行重排,英文对`bge-large-en-v1.5`检索的top-100进行重排。

We re-rank top-100 docments from `bge-large-zh-v1.5` in C-MTEB/Retrieval and from `bge-large-en-v1.5` in BEIR.


| 模型 Model            | C-MTEB/Retrieval (NDCG@10) | BEIR (NDCG@10) |
|----------------------------|-------------------|---------------|
| bge-large-zh-v1.5(Retriever for Chinese)  | 70.46             | -             |
| bge-large-en-v1.5(Retriever for English)  | -                 | 54.29         |
| bge-reranker-v2-m3         | 71.82             | 55.36         |
| bge-reranker-v2-minicpm-28 | 73.51             | 59.86         |
| bge-reranker-v2-gemma      | 71.74             | 60.71         |
| bge-reranker-v2.5-gemma2   | -                 | 63.67    |
| MiniCPM-Reranker                 | 76.79         | 61.32        |
| MiniCPM-Reranker-Light                | 76.19      |	61.34       |

### 中英跨语言重排序结果 CN-EN Cross-lingual Re-ranking Results

对bge-m3(Dense)检索的top100进行重排。

We re-rank top-100 documents from `bge-m3` (Dense).

| 模型 Model                      | MKQA En-Zh_CN (Recall@20) | NeuCLIR22 (NDCG@10) | NeuCLIR23 (NDCG@10) |
|------------------------------------|--------------------|--------------------|--------------------|
| bge-m3 (Dense)(Retriever)              | 66.4               | 30.49              | 41.09              |
| jina-reranker-v2-base-multilingual | 69.33              | 36.66              | 50.03              |
| bge-reranker-v2-m3                 | 69.75              | 40.98              | 49.67              |
| gte-multilingual-reranker-base     | 68.51              | 38.74              | 45.3              |
| MiniCPM-Reranker                         | 71.73          | 43.65          | 50.59          |
| MiniCPM-Reranker-Light                        | 71.34       | 46.04       | 51.86       |

## 许可证 License

- 本仓库中代码依照 [Apache-2.0 协议](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE)开源。
- MiniCPM-Reranker-Light 模型权重的使用则需要遵循 [MiniCPM 模型协议](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md)。
- MiniCPM-Reranker-Light 模型权重对学术研究完全开放。如需将模型用于商业用途,请填写[此问卷](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g)。

* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. 
* The usage of MiniCPM-Reranker-Light model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
* The models and weights of MiniCPM-Reranker-Light are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, MiniCPM-Reranker-Light weights are also available for free commercial use.