MiniCPM-Embedding-Light
MiniCPM-Embedding-Light 是面壁智能与清华大学自然语言处理实验室(THUNLP)、东北大学信息检索小组(NEUIR)共同开发的中英双语言文本嵌入模型,有如下特点:
- 出色的中文、英文检索能力。
- 出色的中英跨语言检索能力。
- 支持长文本(最长8192token)。
- 提供稠密向量与token级别的稀疏向量。
- 可变的稠密向量维度(套娃表征)。
MiniCPM-Embedding-Light结构上采取双向注意力和 Weighted Mean Pooling [1]。采取多阶段训练方式,共使用包括开源数据、机造数据、闭源数据在内的约 260M 条训练数据。
欢迎关注 UltraRAG 系列:
- 检索模型:MiniCPM-Embedding-Light
- 重排模型:MiniCPM-Reranker-Light
- 领域自适应RAG框架:UltraRAG
MiniCPM-Embedding-Light is a bilingual & cross-lingual text embedding model developed by ModelBest Inc. , THUNLP and NEUIR , featuring:
- Exceptional Chinese and English retrieval capabilities.
- Outstanding cross-lingual retrieval capabilities between Chinese and English.
- Long-text support (up to 8192 tokens).
- Dense vectors and token-level sparse vectors.
- Variable dense vector dimensions (Matryoshka representation [2]).
MiniCPM-Embedding-Light incorporates bidirectional attention and Weighted Mean Pooling [1] in its architecture. The model underwent multi-stage training using approximately 260 million training examples, including open-source, synthetic, and proprietary data.
We also invite you to explore the UltraRAG series:
- Retrieval Model: MiniCPM-Embedding-Light
- Re-ranking Model: MiniCPM-Reranker-Light
- Domain Adaptive RAG Framework: UltraRAG
[1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904. [2] Kusupati, Aditya, et al. "Matryoshka representation learning." Advances in Neural Information Processing Systems 35 (2022): 30233-30249.
模型信息 Model Information
模型大小:440M
嵌入维度:1024
最大输入token数:8192
Model Size: 440M
Embedding Dimension: 1024
Max Input Tokens: 8192
使用方法 Usage
输入格式 Input Format
本模型支持 query 侧指令,格式如下:
MiniCPM-Embedding-Light supports query-side instructions in the following format:
Instruction: {{ instruction }} Query: {{ query }}
例如:
For example:
Instruction: 为这个医学问题检索相关回答。Query: 咽喉癌的成因是什么?
Instruction: Given a claim about climate change, retrieve documents that support or refute the claim. Query: However the warming trend is slower than most climate models have forecast.
也可以不提供指令,即采取如下格式:
MiniCPM-Embedding-Light also works in instruction-free mode in the following format:
Query: {{ query }}
环境要求 Requirements
transformers==4.37.2
示例脚本 Demo
Huggingface Transformers
from transformers import AutoModel
import torch
model_name = "OpenBMB/MiniCPM-Embedding-Light"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")
# you can use flash_attention_2 for faster inference
# model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda")
model.eval()
queries = ["MiniCPM-o 2.6 A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone"]
passages = ["MiniCPM-o 2.6 is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming."]
embeddings_query_dense, embeddings_query_sparse = model.encode_query(queries, return_sparse_vectors=True)
embeddings_doc_dense, embeddings_doc_sparse = model.encode_corpus(passages, return_sparse_vectors=True)
dense_scores = (embeddings_query_dense @ embeddings_doc_dense.T)
print(dense_scores.tolist()) # [[0.6512398719787598]]
print(model.compute_sparse_score_dicts(embeddings_query_sparse, embeddings_doc_sparse)) # [[0.27202296]]
dense_scores, sparse_scores, mixed_scores = model.compute_score(queries, passages)
print(dense_scores) # [[0.65123993]]
print(sparse_scores) # [[0.27202296]]
print(mixed_scores) # [[0.73284686]]
Sentence Transformers
import torch
from sentence_transformers import SentenceTransformer
model_name = "openbmb/MiniCPM-Embedding-Light"
model = SentenceTransformer(model_name, trust_remote_code=True, model_kwargs={"torch_dtype": torch.float16})
# you can use flash_attention_2 for faster inference
# model = SentenceTransformer(model_name, trust_remote_code=True, model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": torch.float16})
queries = ["中国的首都是哪里?"] # "What is the capital of China?"
passages = ["beijing", "shanghai"] # "北京", "上海"
INSTRUCTION = "Query: "
embeddings_query = model.encode(queries, prompt=INSTRUCTION)
embeddings_doc = model.encode(passages)
scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist()) # [[0.40356746315956116, 0.36183440685272217]]
Infinity
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
import numpy as np
array = AsyncEngineArray.from_args([
EngineArgs(model_name_or_path = "OpenBMB/MiniCPM-Embedding-Light", engine="torch", dtype="float16", bettertransformer=False, pooling_method="mean", trust_remote_code=True),
])
queries = ["中国的首都是哪里?"] # "What is the capital of China?"
passages = ["beijing", "shanghai"] # "北京", "上海"
INSTRUCTION = "Query:"
queries = [f"{INSTRUCTION} {query}" for query in queries]
async def embed_text(engine: AsyncEmbeddingEngine,sentences):
async with engine:
embeddings, usage = await engine.embed(sentences=sentences)
return embeddings
queries_embedding = asyncio.run(embed_text(array[0],queries))
passages_embedding = asyncio.run(embed_text(array[0],passages))
scores = (np.array(queries_embedding) @ np.array(passages_embedding).T)
print(scores.tolist()) # [[0.40356746315956116, 0.36183443665504456]]
FlagEmbedding
from FlagEmbedding import FlagModel
model = FlagModel("OpenBMB/MiniCPM-Embedding-Light",
query_instruction_for_retrieval="Query: ",
pooling_method="mean",
trust_remote_code=True,
normalize_embeddings=True,
use_fp16=True)
# You can hack the __init__() method of the FlagEmbedding BaseEmbedder class to use flash_attention_2 for faster inference
# self.model = AutoModel.from_pretrained(
# model_name_or_path,
# trust_remote_code=trust_remote_code,
# cache_dir=cache_dir,
# # torch_dtype=torch.float16, # we need to add this line to use fp16
# # attn_implementation="flash_attention_2", # we need to add this line to use flash_attention_2
# )
queries = ["中国的首都是哪里?"] # "What is the capital of China?"
passages = ["beijing", "shanghai"] # "北京", "上海"
embeddings_query = model.encode_queries(queries)
embeddings_doc = model.encode_corpus(passages)
scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist()) # [[0.40356746315956116, 0.36183440685272217]]
实验结果 Evaluation Results
中文与英文检索结果 CN/EN Retrieval Results
模型 Model | C-MTEB/Retrieval(NDCG@10) | BEIR(NDCG@10) |
---|---|---|
bge-large-zh-v1.5 | 70.46 | - |
gte-large-zh | 72.49 | - |
Conan-embedding-v1 | 76.67 | |
bge-large-en-v1.5 | - | 54.29 |
modernbert-embed-large | - | 54.36 |
snowflake-arctic-embed-l | - | 55.98 |
gte-en-large-v1.5 | - | 57.91 |
me5-large | 63.66 | 51.43 |
bge-m3(Dense) | 65.43 | 48.82 |
gte-multilingual-base(Dense) | 71.95 | 51.08 |
jina-embeddings-v3 | 68.60 | 53.88 |
gte-Qwen2-1.5B-instruct | 71.86 | 58.29 |
MiniCPM-Embedding | 76.76 | 58.56 |
MiniCPM-Embedding-Light(Dense) | 72.71 | 55.27 |
MiniCPM-Embedding-Light(Dense+Sparse) | 73.13 | 56.31 |
MiniCPM-Embedding-Light(Dense+Sparse)+MiniCPM-Reranker-Light | 76.34 | 61.49 |
中英跨语言检索结果 CN-EN Cross-lingual Retrieval Results
模型 Model | MKQA En-Zh_CN (Recall@20) | NeuCLIR22 (NDCG@10) | NeuCLIR23 (NDCG@10) |
---|---|---|---|
me5-large | 44.3 | 9.01 | 25.33 |
bge-m3(Dense) | 66.4 | 30.49 | 41.09 |
gte-multilingual-base(Dense) | 68.2 | 39.46 | 45.86 |
MiniCPM-Embedding | 72.95 | 52.65 | 49.95 |
MiniCPM-Embedding-Light(Dense) | 68.29 | 41.17 | 45.83 |
MiniCPM-Embedding-Light(Dense)+MiniCPM-Reranker-Light | 71.86 | 54.32 | 56.50 |
许可证 License
- 本仓库中代码依照 Apache-2.0 协议开源。
- MiniCPM-Embedding-Light 模型权重的使用则需要遵循 MiniCPM 模型协议。
- MiniCPM-Embedding-Light 模型权重对学术研究完全开放。如需将模型用于商业用途,请填写此问卷。
- The code in this repo is released under the Apache-2.0 License.
- The usage of MiniCPM-Embedding-Light model weights must strictly follow MiniCPM Model License.md.
- The models and weights of MiniCPM-Embedding-Light are completely free for academic research. After filling out a "questionnaire" for registration, MiniCPM-Embedding-Light weights are also available for free commercial use.
- Downloads last month
- 168
Evaluation results
- cosine_pearson on MTEB AFQMC (default)validation set self-reported31.602
- cosine_spearman on MTEB AFQMC (default)validation set self-reported32.266
- euclidean_pearson on MTEB AFQMC (default)validation set self-reported31.387
- euclidean_spearman on MTEB AFQMC (default)validation set self-reported32.266
- main_score on MTEB AFQMC (default)validation set self-reported32.266
- manhattan_pearson on MTEB AFQMC (default)validation set self-reported31.012
- manhattan_spearman on MTEB AFQMC (default)validation set self-reported31.881
- pearson on MTEB AFQMC (default)validation set self-reported31.602
- spearman on MTEB AFQMC (default)validation set self-reported32.266
- cosine_pearson on MTEB ATEC (default)test set self-reported40.900