Stella-PL-retrieval

This is a text encoder based on stella_en_1.5B_v5 and further fine-tuned for Polish information retrieval tasks.

  • In the first step, we adapted the model for Polish with multilingual knowledge distillation method using a diverse corpus of 20 million Polish-English text pairs.
  • The second step involved fine-tuning the model with contrastrive loss using a dataset consisting of 1.4 million queries. Positive and negative passages for each query have been selected with the help of BAAI/bge-reranker-v2.5-gemma2-lightweight reranker. The model was trained for three epochs with a batch size of 1024 queries.

The encoder transforms texts to 1024 dimensional vectors. The model is optimized specifically for Polish information retrieval tasks. If you need a more versatile encoder, suitable for a wider range of tasks such as semantic similarity or clustering, you should probably use the distilled version from the first step: sdadas/stella-pl.

Usage (Sentence-Transformers)

The model utilizes the same prompts as the original stella_en_1.5B_v5.

For retrieval, queries should be prefixed with "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: ".

For symmetric tasks such as semantic similarity, both texts should be prefixed with "Instruct: Retrieve semantically similar text.\nQuery: ".

Please note that the model uses a custom implementation, so you should add trust_remote_code=True argument when loading it. It is also recommended to use Flash Attention 2, which can be enabled with attn_implementation argument. You can use the model like this with sentence-transformers:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer(
    "sdadas/stella-pl-retrieval",
    trust_remote_code=True,
    device="cuda",
    model_kwargs={"attn_implementation": "flash_attention_2", "trust_remote_code": True}
)
model.bfloat16()

# Retrieval example
query_prefix = "Instruct: Given a web search query, retrieve relevant passages that answer the query.\nQuery: "
queries = [query_prefix + "Jak dożyć 100 lat?"]
answers = [
    "Trzeba zdrowo się odżywiać i uprawiać sport.",
    "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)
best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
print(answers[best_answer])

# Semantic similarity example
sim_prefix = "Instruct: Retrieve semantically similar text.\nQuery: "
sentences = [
    sim_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
    sim_prefix + "Warto jest prowadzić zdrowy tryb życia, uwzględniający aktywność fizyczną i dietę.",
    sim_prefix + "One should eat healthy and engage in sports.",
    sim_prefix + "Zakupy potwierdzasz PINem, który bezpiecznie ustalisz podczas aktywacji."
]
emb = model.encode(sentences, convert_to_tensor=True, show_progress_bar=False)
print(cos_sim(emb, emb))

Evaluation Results

The model achieves NDCG@10 of 62.32 on the Polish Information Retrieval Benchmark. See PIRB Leaderboard for detailed results.

Citation

@article{dadas2024pirb,
  title={{PIRB}: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods}, 
  author={Sławomir Dadas and Michał Perełkiewicz and Rafał Poświata},
  year={2024},
  eprint={2402.13350},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}
Downloads last month
253
Safetensors
Model size
1.54B params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Collection including sdadas/stella-pl-retrieval