MiniCPM-Embedding-Light

MiniCPM-Embedding-Light 是面壁智能与清华大学自然语言处理实验室(THUNLP)、东北大学信息检索小组(NEUIR)共同开发的中英双语言文本嵌入模型,有如下特点:

  • 出色的中文、英文检索能力。
  • 出色的中英跨语言检索能力。
  • 支持长文本(最长8192token)。
  • 提供稠密向量与token级别的稀疏向量。
  • 可变的稠密向量维度(套娃表征)。

MiniCPM-Embedding-Light结构上采取双向注意力和 Weighted Mean Pooling [1]。采取多阶段训练方式,共使用包括开源数据、机造数据、闭源数据在内的约 260M 条训练数据。

欢迎关注 UltraRAG 系列:

MiniCPM-Embedding-Light is a bilingual & cross-lingual text embedding model developed by ModelBest Inc. , THUNLP and NEUIR , featuring:

  • Exceptional Chinese and English retrieval capabilities.
  • Outstanding cross-lingual retrieval capabilities between Chinese and English.
  • Long-text support (up to 8192 tokens).
  • Dense vectors and token-level sparse vectors.
  • Variable dense vector dimensions (Matryoshka representation [2]).

MiniCPM-Embedding-Light incorporates bidirectional attention and Weighted Mean Pooling [1] in its architecture. The model underwent multi-stage training using approximately 260 million training examples, including open-source, synthetic, and proprietary data.

We also invite you to explore the UltraRAG series:

[1] Muennighoff, N. (2022). Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904. [2] Kusupati, Aditya, et al. "Matryoshka representation learning." Advances in Neural Information Processing Systems 35 (2022): 30233-30249.

模型信息 Model Information

  • 模型大小:440M

  • 嵌入维度:1024

  • 最大输入token数:8192

  • Model Size: 440M

  • Embedding Dimension: 1024

  • Max Input Tokens: 8192

使用方法 Usage

输入格式 Input Format

本模型支持 query 侧指令,格式如下:

MiniCPM-Embedding-Light supports query-side instructions in the following format:

Instruction: {{ instruction }} Query: {{ query }}

例如:

For example:

Instruction: 为这个医学问题检索相关回答。Query: 咽喉癌的成因是什么?
Instruction: Given a claim about climate change, retrieve documents that support or refute the claim. Query: However the warming trend is slower than most climate models have forecast.

也可以不提供指令,即采取如下格式:

MiniCPM-Embedding-Light also works in instruction-free mode in the following format:

Query: {{ query }}

环境要求 Requirements

transformers==4.37.2

示例脚本 Demo

Huggingface Transformers

from transformers import AutoModel
import torch

model_name = "OpenBMB/MiniCPM-Embedding-Light"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.float16).to("cuda")

# you can use flash_attention_2 for faster inference
# model = AutoModel.from_pretrained(model_name, trust_remote_code=True, attn_implementation="flash_attention_2", torch_dtype=torch.float16).to("cuda") 

model.eval()

queries = ["MiniCPM-o 2.6 A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone"]
passages = ["MiniCPM-o 2.6 is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming."]

embeddings_query_dense, embeddings_query_sparse = model.encode_query(queries, return_sparse_vectors=True)
embeddings_doc_dense, embeddings_doc_sparse = model.encode_corpus(passages, return_sparse_vectors=True)

dense_scores = (embeddings_query_dense @ embeddings_doc_dense.T)
print(dense_scores.tolist())  # [[0.6512398719787598]]
print(model.compute_sparse_score_dicts(embeddings_query_sparse,  embeddings_doc_sparse)) # [[0.27202296]]

dense_scores, sparse_scores, mixed_scores = model.compute_score(queries, passages)
print(dense_scores) # [[0.65123993]]
print(sparse_scores) # [[0.27202296]]
print(mixed_scores) # [[0.73284686]]

Sentence Transformers

import torch
from sentence_transformers import SentenceTransformer


model_name = "openbmb/MiniCPM-Embedding-Light"
model = SentenceTransformer(model_name, trust_remote_code=True, model_kwargs={"torch_dtype": torch.float16})

# you can use flash_attention_2 for faster inference
# model = SentenceTransformer(model_name, trust_remote_code=True, model_kwargs={"attn_implementation": "flash_attention_2", "torch_dtype": torch.float16})

queries = ["中国的首都是哪里?"] # "What is the capital of China?"
passages = ["beijing", "shanghai"] # "北京", "上海"

INSTRUCTION = "Query: "

embeddings_query = model.encode(queries, prompt=INSTRUCTION)
embeddings_doc = model.encode(passages)

scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())  # [[0.40356746315956116, 0.36183440685272217]]

Infinity

import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
import numpy as np

array = AsyncEngineArray.from_args([
  EngineArgs(model_name_or_path = "OpenBMB/MiniCPM-Embedding-Light", engine="torch", dtype="float16", bettertransformer=False, pooling_method="mean", trust_remote_code=True),
])
queries = ["中国的首都是哪里?"] # "What is the capital of China?"
passages = ["beijing", "shanghai"] # "北京", "上海"

INSTRUCTION = "Query:"
queries = [f"{INSTRUCTION} {query}" for query in queries]


async def embed_text(engine: AsyncEmbeddingEngine,sentences): 
    async with engine: 
        embeddings, usage = await engine.embed(sentences=sentences)
    return embeddings

queries_embedding = asyncio.run(embed_text(array[0],queries))
passages_embedding = asyncio.run(embed_text(array[0],passages))

scores = (np.array(queries_embedding) @ np.array(passages_embedding).T)
print(scores.tolist())  # [[0.40356746315956116, 0.36183443665504456]]

FlagEmbedding

from FlagEmbedding import FlagModel


model = FlagModel("OpenBMB/MiniCPM-Embedding-Light", 
                          query_instruction_for_retrieval="Query: ",
                          pooling_method="mean",
                          trust_remote_code=True,
                          normalize_embeddings=True,
                          use_fp16=True)
# You can hack the __init__() method of the FlagEmbedding BaseEmbedder class to use flash_attention_2 for faster inference
#  self.model = AutoModel.from_pretrained(
#             model_name_or_path,
#             trust_remote_code=trust_remote_code,
#             cache_dir=cache_dir,
#             # torch_dtype=torch.float16, # we need to add this line to use fp16
#             # attn_implementation="flash_attention_2", # we need to add this line to use flash_attention_2
#         )

queries = ["中国的首都是哪里?"] # "What is the capital of China?"
passages = ["beijing", "shanghai"] # "北京", "上海"


embeddings_query = model.encode_queries(queries)
embeddings_doc = model.encode_corpus(passages)

scores = (embeddings_query @ embeddings_doc.T)
print(scores.tolist())  # [[0.40356746315956116, 0.36183440685272217]]

实验结果 Evaluation Results

中文与英文检索结果 CN/EN Retrieval Results

模型 Model C-MTEB/Retrieval(NDCG@10) BEIR(NDCG@10)
bge-large-zh-v1.5 70.46 -
gte-large-zh 72.49 -
Conan-embedding-v1 76.67
bge-large-en-v1.5 - 54.29
modernbert-embed-large - 54.36
snowflake-arctic-embed-l - 55.98
gte-en-large-v1.5 - 57.91
me5-large 63.66 51.43
bge-m3(Dense) 65.43 48.82
gte-multilingual-base(Dense) 71.95 51.08
jina-embeddings-v3 68.60 53.88
gte-Qwen2-1.5B-instruct 71.86 58.29
MiniCPM-Embedding 76.76 58.56
MiniCPM-Embedding-Light(Dense) 72.71 55.27
MiniCPM-Embedding-Light(Dense+Sparse) 73.13 56.31
MiniCPM-Embedding-Light(Dense+Sparse)+MiniCPM-Reranker-Light 76.34 61.49

中英跨语言检索结果 CN-EN Cross-lingual Retrieval Results

模型 Model MKQA En-Zh_CN (Recall@20) NeuCLIR22 (NDCG@10) NeuCLIR23 (NDCG@10)
me5-large 44.3 9.01 25.33
bge-m3(Dense) 66.4 30.49 41.09
gte-multilingual-base(Dense) 68.2 39.46 45.86
MiniCPM-Embedding 72.95 52.65 49.95
MiniCPM-Embedding-Light(Dense) 68.29 41.17 45.83
MiniCPM-Embedding-Light(Dense)+MiniCPM-Reranker-Light 71.86 54.32 56.50

许可证 License

  • 本仓库中代码依照 Apache-2.0 协议开源。
  • MiniCPM-Embedding-Light 模型权重的使用则需要遵循 MiniCPM 模型协议
  • MiniCPM-Embedding-Light 模型权重对学术研究完全开放。如需将模型用于商业用途,请填写此问卷
  • The code in this repo is released under the Apache-2.0 License.
  • The usage of MiniCPM-Embedding-Light model weights must strictly follow MiniCPM Model License.md.
  • The models and weights of MiniCPM-Embedding-Light are completely free for academic research. After filling out a "questionnaire" for registration, MiniCPM-Embedding-Light weights are also available for free commercial use.
Downloads last month
168
Safetensors
Model size
434M params
Tensor type
BF16
·
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.

Evaluation results