BAAI-Multilingual-Base
BAAI-Multilingual-Base is a text embedding model distinguished for its versatility in Multi-Functionality, Multi-Linguality, and Multi-Granularity.
- Multi-Functionality: It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval.
- Multi-Linguality: It can support more than 100 working languages.
- Multi-Granularity: It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens.
Usage
Install:
pip install -U FlagEmbedding
Generate Embedding for text
- Dense Embedding
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel('hanhainebula/baai-multilingual-base',
use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"]
sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
embeddings_1 = model.encode(sentences_1,
batch_size=12,
max_length=8192, # If you don't need such a long length, you can set a smaller value to speed up the encoding process.
)['dense_vecs']
embeddings_2 = model.encode(sentences_2)['dense_vecs']
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
# [[0.7026 0.439 ]
# [0.361 0.678 ]]
You also can use sentence-transformers and huggingface transformers to generate dense embeddings.
- Sparse Embedding (Lexical Weight)
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel('hanhainebula/baai-multilingual-base',
use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"]
sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=False)
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=False)
# you can see the weight for each token:
print(model.convert_id_to_token(output_1['lexical_weights']))
# [{'What': 0.10126, 'is': 0.1063, 'BA': 0.1858, 'AI': 0.2576, '-': 0.05154, 'Mul': 0.1381, 'ti': 0.1404, 'lingu': 0.2734, 'al': 0.10095,
# 'Bas': 0.2299, 'e': 0.153, '?': 0.05536}, {'De': 0.05002, 'fin': 0.1368, 'ation': 0.04495, 'of': 0.0633, 'BM': 0.2517, '25': 0.3333}]
# compute the scores via lexical mathcing
lexical_scores = model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_2['lexical_weights'][0])
print(lexical_scores)
# 0.3666038513183594
print(model.compute_lexical_matching_score(output_1['lexical_weights'][0], output_1['lexical_weights'][1]))
# 0.0
- Multi-Vector (ColBERT)
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel('hanhainebula/baai-multilingual-base',
use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"]
sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
output_1 = model.encode(sentences_1, return_dense=True, return_sparse=True, return_colbert_vecs=True)
output_2 = model.encode(sentences_2, return_dense=True, return_sparse=True, return_colbert_vecs=True)
print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][0]))
print(model.colbert_score(output_1['colbert_vecs'][0], output_2['colbert_vecs'][1]))
# 0.7982
# 0.4389
Compute score for text pairs
Input a list of text pairs, you can get the scores computed by different methods.
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel('hanhainebula/baai-multilingual-base',
use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
sentences_1 = ["What is BAAI-Multilingual-Base?", "Defination of BM25"]
sentences_2 = ["BAAI-Multilingual-Base is an embedding model supporting dense retrieval, lexical matching and multi-vector interaction.",
"BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document"]
sentence_pairs = [[i,j] for i in sentences_1 for j in sentences_2]
print(model.compute_score(sentence_pairs,
max_passage_length=128, # a smaller max length leads to a lower latency
weights_for_different_modes=[0.4, 0.2, 0.4])) # weights_for_different_modes(w) is used to do weighted sum: w[0]*dense_score + w[1]*sparse_score + w[2]*colbert_score
# {
# 'colbert': [0.7982305884361267, 0.438856840133667, 0.4464578628540039, 0.7897794842720032],
# 'sparse': [0.366455078125, 0.01297760009765625, 0.0, 0.1802978515625],
# 'dense': [0.70263671875, 0.43896484375, 0.361083984375, 0.67822265625],
# 'sparse+dense': [0.5905762314796448, 0.29696908593177795, 0.2407226711511612, 0.5122477412223816],
# 'colbert+sparse+dense': [0.6736379861831665, 0.3537241816520691, 0.3230167627334595, 0.6232604384422302]
# }
- Downloads last month
- 0
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support