cointegrated's picture
Update README.md
0528a07
metadata
license: apache-2.0
datasets:
  - AigizK/bashkir-russian-parallel-corpora
language:
  - ba
pipeline_tag: sentence-similarity

This is a shallow (3 layers) BERT-like model, trained on the Bashkir language to compute sentence embedings compatible with LaBSE and to do masked language modelling.

The following code can be used to extract sentence embedings:

import torch
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained('slone/LaBSE-shallow-distilled-bak')
tokenizer = AutoTokenizer.from_pretrained('slone/LaBSE-shallow-distilled-bak')

def embed(texts, max_length=512):
    b = tokenizer(texts, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
    with torch.inference_mode():
        return torch.nn.functional.normalize(model(**b.to(model.device)).pooler_output).cpu().numpy()

embeddings = embed(['Сәләм, ғаләм!', 'Хәйерле көн, тыныслыҡ.', 'Бөгөн йома.'])
print(embeddings.shape)
# (3, 768)
print(embeddings.dot(embeddings.T).round(2))
# [[1.   0.56 0.18]
#  [0.56 1.   0.32]
#  [0.18 0.32 1.  ]]

For semantically equivalent sentence pairs, the dot products of these embeddings (which are also their cosine similarities, because the vectors are L2-normed) are usually above 0.4.