KaLM-Embedding

KaLM-Embedding is a series of embedding models adapted from auto-regressive LLMs with superior training data.

KaLM-embedding-multilingual-mini is trained from Qwen/Qwen2-0.5B with massive weakly-supervised pre-training and supervised fine-tuning data.

📑 Open-source Plan

Evaluation

Model Name Model Size C-MTEB(35) MTEB(56) avg
multilingual-e5-large 560M 58.81 61.5 60.16
bge-m3 (dense) 560M 60.80 59.84 60.32
gte-multilingual-base (dense) 305M 62.72 61.40 62.06
KaLM-embedding-multilingual-mini-v1 494M 62.31 61.87 62.09
KaLM-embedding-multilingual-mini-instruct-v1 494M 63.57 64.74 64.16
KaLM-embedding-multilingual-mini-instruct-v1.5 494M 64.13 64.94 64.53

Requirements

Since we have used the Qwen2 model, we advise you to install transformers>=4.37.0, or you might encounter the following error:

KeyError: 'qwen2'

Usage

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer


sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME_OR_PATH}')   # Do NOT set trust_remote_code
model.max_seq_length = 512

embeddings = model.encode(
    sentences, 
    normalize_embeddings=True,
    batch_size=256, 
    show_progress_bar=True
    )
print(embeddings)

We add instruction for asymmetric tasks: retrieval, reranking, classification and clustering.

If you want to add instruction to the query (no instruction for the corpus), you can use the model like this:

from sentence_transformers import SentenceTransformer


sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('{MODEL_NAME_OR_PATH}')   # Do NOT set trust_remote_code
model.max_seq_length = 512

prompt = "Instruct: Classifying the category of french news. \n Query: "
embeddings = model.encode(
    sentences, 
    prompt=prompt,
    normalize_embeddings=True,
    batch_size=256, 
    show_progress_bar=True
    )
print(embeddings)

Contact

If you encounter any issue, feel free to contact us via the email: [email protected]

Downloads last month
5,551
Safetensors
Model size
494M params
Tensor type
F32
·
Inference API

Space using HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5 1

Collection including HIT-TMG/KaLM-embedding-multilingual-mini-instruct-v1.5

Evaluation results