Stella-PL

This is a bilingual Polish-English text encoder based on stella_en_1.5B_v5. We adapted the model for Polish with multilingual knowledge distillation method using a diverse corpus of 20 million Polish-English text pairs. It transforms texts to 1024 dimensional vectors. For English texts, the produced embeddings should be similar to the original Stella model. The encoder can be used to compare embeddings in the same language (Polish or English), as well as across languages.

Usage (Sentence-Transformers)

The model utilizes the same prompts as the original stella_en_1.5B_v5.

For retrieval, queries should be prefixed with "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: ".

For symmetric tasks such as semantic similarity, both texts should be prefixed with "Instruct: Retrieve semantically similar text.\nQuery: ".

Please note that the model uses a custom implementation, so you should add trust_remote_code=True argument when loading it. It is also recommended to use Flash Attention 2, which can be enabled with attn_implementation argument. You can use the model like this with sentence-transformers:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer(
    "sdadas/stella-pl",
    trust_remote_code=True,
    device="cuda",
    model_kwargs={"attn_implementation": "flash_attention_2", "trust_remote_code": True}
)
model.bfloat16()

# Retrieval example
query_prefix = "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: "
queries = [query_prefix + "Jak dożyć 100 lat?"]
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)
best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
print(answers[best_answer])

# Semantic similarity example
sim_prefix = "Instruct: Retrieve semantically similar text.\nQuery: "
sentences = [
    sim_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
    sim_prefix + "Warto jest prowadzić zdrowy tryb życia, uwzględniający aktywność fizyczną i dietę.",
    sim_prefix + "One should eat healthy and engage in sports.",
    sim_prefix + "Zakupy potwierdzasz PINem, który bezpiecznie ustalisz podczas aktywacji."
]
emb = model.encode(sentences, convert_to_tensor=True, show_progress_bar=False)
print(cos_sim(emb, emb))

Evaluation Results

The model achieves NDCG@10 of 60.52 on the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.

Citation

@article{dadas2024pirb,
  title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods}, 
  author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata},
  year={2024},
  eprint={2402.13350},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}
Downloads last month
31
Safetensors
Model size
1.54B params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including sdadas/stella-pl