MMLW-roberta-base

---
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
language: pl
license: apache-2.0
widget:
- source_sentence: "zapytanie: Jak dożyć 100 lat?"
  sentences:
    - "Trzeba zdrowo się odżywiać i uprawiać sport."
    - "Trzeba pić alkohol, imprezować i jeździć szybkimi autami."
    - "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."

---

<h1 align="center">MMLW-roberta-base</h1>

MMLW (muszę mieć lepszą wiadomość) are neural text encoders for Polish.
This is a distilled model that can be used to generate embeddings applicable to many tasks such as semantic similarity, clustering, information retrieval. The model can also serve as a base for further fine-tuning. 
It transforms texts to 768 dimensional vectors. 
The model was initialized with Polish RoBERTa checkpoint, and then trained with [multilingual knowledge distillation method](https://aclanthology.org/2020.emnlp-main.365/) on a diverse corpus of 60 million Polish-English text pairs. We utilised [English FlagEmbeddings (BGE)](https://huggingface.co/BAAI/bge-base-en) as teacher models for distillation. 

## Usage (Sentence-Transformers)

⚠️ Our embedding models require the use of specific prefixes and suffixes when encoding texts. For this model, each query should be preceded by the prefix **"zapytanie: "** ⚠️

You can use the model like this with [sentence-transformers](https://www.SBERT.net):

```python
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

query_prefix = "zapytanie: "
answer_prefix = ""
queries = [query_prefix + "Jak dożyć 100 lat?"]
answers = [
    answer_prefix + "Trzeba zdrowo się odżywiać i uprawiać sport.",
    answer_prefix + "Trzeba pić alkohol, imprezować i jeździć szybkimi autami.",
    answer_prefix + "Gdy trwała kampania politycy zapewniali, że rozprawią się z zakazem niedzielnego handlu."
]
model = SentenceTransformer("sdadas/mmlw-roberta-base")
queries_emb = model.encode(queries, convert_to_tensor=True, show_progress_bar=False)
answers_emb = model.encode(answers, convert_to_tensor=True, show_progress_bar=False)

best_answer = cos_sim(queries_emb, answers_emb).argmax().item()
print(answers[best_answer])
# Trzeba zdrowo się odżywiać i uprawiać sport.
```

## Evaluation Results

The model achieves **NDCG@10** of **53.60** on the Polish Information Retrieval Benchmark. See [PIRB Leaderboard](https://huggingface.co/spaces/sdadas/pirb) for detailed results.

## Acknowledgements
This model was trained with the A100 GPU cluster support delivered by the Gdansk University of Technology within the TASK center initiative.