BAAI-Multilingual-Base

BAAI-Multilingual-Base is a text embedding model distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.

Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
Multi-Linguality: It can support more than 100 working languages.
Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.

Usage

Install:

pip install -U FlagEmbedding

Generate Embedding for text

Dense Embedding

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('hanhainebula/baai-multilingual-base',
                       use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"]
sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

embeddings_1 = model.encode(sentences_1, 
                            batch_size=12, 
                            max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process.
                            )['dense_vecs']
embeddings_2 = model.encode(sentences_2)['dense_vecs']
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
# [[0.7026 0.439 ]
#  [0.361  0.678 ]]

You also can use sentence-transformers and huggingface transformers to generate dense embeddings.

Sparse Embedding (Lexical Weight)

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('hanhainebula/baai-multilingual-base',
                       use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"]
sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)

# you can see the weight for each token:
print(model.convert_id_to_token(output_1['lexical_weights']))
# [{'What': 0.10126, 'is': 0.1063, 'BA': 0.1858, 'AI': 0.2576, '-': 0.05154, 'Mul': 0.1381, 'ti': 0.1404, 'lingu': 0.2734, 'al': 0.10095, 
# 'Bas': 0.2299, 'e': 0.153, '?': 0.05536}, {'De': 0.05002, 'fin': 0.1368, 'ation': 0.04495, 'of': 0.0633, 'BM': 0.2517, '25': 0.3333}]


# compute the scores via lexical mathcing
lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
print(lexical_scores)
# 0.3666038513183594

print(model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_1['lexical_weights'][1]))
# 0.0

Multi-Vector (ColBERT)

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('hanhainebula/baai-multilingual-base',
                       use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"]
sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True)
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=True)

print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]))
print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]))
# 0.7982
# 0.4389

Compute score for text pairs

Input a list of text pairs, you can get the scores computed by different methods.

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('hanhainebula/baai-multilingual-base',
                       use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation

sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"]
sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.", 
               "BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]

sentence_pairs = [[i,j] for i in sentences_1 for j in sentences_2]

print(model.compute_score(sentence_pairs, 
                          max_passage_length=128, # a smaller max length leads to a lower latency
                          weights_for_different_modes=[0.4, 0.2, 0.4])) # weights_for_different_modes(w) is used to do weighted sum: w[0]*dense_score + w[1]*sparse_score + w[2]*colbert_score

# {
#   'colbert': [0.7982305884361267, 0.438856840133667, 0.4464578628540039, 0.7897794842720032],
#   'sparse': [0.366455078125, 0.01297760009765625, 0.0, 0.1802978515625],
#   'dense': [0.70263671875, 0.43896484375, 0.361083984375, 0.67822265625],
#   'sparse+dense': [0.5905762314796448, 0.29696908593177795, 0.2407226711511612, 0.5122477412223816],
#   'colbert+sparse+dense': [0.6736379861831665, 0.3537241816520691, 0.3230167627334595, 0.6232604384422302]
# }