metadata
license: apache-2.0
datasets:
- AigizK/bashkir-russian-parallel-corpora
language:
- ba
pipeline_tag: sentence-similarity
This is a shallow (3 layers) BERT-like model, trained on the Bashkir language to compute sentence embedings compatible with LaBSE and to do masked language modelling.
The following code can be used to extract sentence embedings:
import torch
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('slone/LaBSE-shallow-distilled-bak')
tokenizer = AutoTokenizer.from_pretrained('slone/LaBSE-shallow-distilled-bak')
def embed(texts, max_length=512):
b = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
with torch.inference_mode():
return torch.nn.functional.normalize(model(**b.to(model.device)).pooler_output).cpu().numpy()
embeddings = embed(['Сәләм, ғаләм!', 'Хәйерле көн, тыныслыҡ.', 'Бөгөн йома.'])
print(embeddings.shape)
# (3, 768)
print(embeddings.dot(embeddings.T).round(2))
# [[1. 0.56 0.18]
# [0.56 1. 0.32]
# [0.18 0.32 1. ]]
For semantically equivalent sentence pairs, the dot products of these embeddings (which are also their cosine similarities, because the vectors are L2-normed) are usually above 0.4.