|
--- |
|
library_name: transformers |
|
license: mit |
|
datasets: |
|
- nlpai-lab/ko-triplet-v1.0 |
|
language: |
|
- ko |
|
- en |
|
base_model: |
|
- intfloat/multilingual-e5-large |
|
pipeline_tag: sentence-similarity |
|
--- |
|
|
|
# KoE5 |
|
|
|
Introducing KoE5, a model with advanced retrieval abilities. |
|
It has shown remarkable performance in Korean text retrieval, speficially overwhelming most multilingual embedding models. |
|
To our knowledge, It is one of the best publicly opened Korean retrieval models. |
|
|
|
For details, visit the [KoE5 repository](https://github.com/nlpai-lab/KoE5) |
|
|
|
### Model Description |
|
|
|
This is the model card of a π€ transformers model that has been pushed on the Hub. |
|
|
|
- **Developed by:** [NLP&AI Lab](http://nlp.korea.ac.kr/) |
|
- **Language(s) (NLP):** Korean, English |
|
- **License:** MIT |
|
- **Finetuned from model:** [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large) |
|
- **Finetuned dataset:** [ko-triplet-v1.0](nlpai-lab/ko-triplet-v1.0) |
|
|
|
## Example code |
|
### Install Dependencies |
|
First install the Sentence Transformers library: |
|
|
|
```bash |
|
pip install -U sentence-transformers |
|
``` |
|
### Python code |
|
Then you can load this model and run inference. |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
|
|
# Download from the π€ Hub |
|
model = SentenceTransformer("nlpai-lab/KoE5") |
|
|
|
# Run inference |
|
sentences = [ |
|
'query: νλ²κ³Ό λ²μμ‘°μ§λ²μ μ΄λ€ λ°©μμ ν΅ν΄ κΈ°λ³ΈκΆ λ³΄μ₯ λ±μ λ€μν λ²μ λͺ¨μμ κ°λ₯νκ² νμ΄', |
|
'passage: 4. μμ¬μ κ³Ό κ°μ λ°©ν₯ μμ μ΄ν΄λ³Έ λ°μ κ°μ΄ μ°λ¦¬ νλ²κ³Ό ο½’λ²μμ‘°μ§ λ²ο½£μ λλ²μ ꡬμ±μ λ€μννμ¬ κΈ°λ³ΈκΆ λ³΄μ₯κ³Ό λ―Όμ£Όμ£Όμ ν립μ μμ΄ λ€κ°μ μΈ λ²μ λͺ¨μμ κ°λ₯νκ² νλ κ²μ κ·Όλ³Έ κ·λ²μΌλ‘ νκ³ μλ€. λμ±μ΄ ν©μ체λ‘μμ λλ²μ μ리λ₯Ό μ±ννκ³ μλ κ² μμ κ·Έ ꡬμ±μ λ€μμ±μ μμ²νλ κ²μΌλ‘ ν΄μλλ€. μ΄μ κ°μ κ΄μ μμ λ³Ό λ νμ§ λ²μμ₯κΈ κ³ μλ²κ΄μ μ€μ¬μΌλ‘ λλ²μμ ꡬμ±νλ κ΄νμ κ°μ ν νμκ° μλ κ²μΌλ‘ 보μΈλ€.', |
|
'passage: β‘ μ°λ°©νλ²μ¬νμλ 2001λ
1μ 24μΌ 5:3μ λ€μ견ν΄λ‘ γλ²μμ‘°μ§λ²γ μ 169μ‘° μ 2λ¬Έμ΄ νλ²μ ν©μΉλλ€λ νκ²°μ λ΄λ Έμ β 5μΈμ λ€μ μ¬νκ΄μ μμ‘κ΄κ³μΈμ μΈκ²©κΆ 보νΈ, 곡μ ν μ μ°¨μ 보μ₯κ³Ό λ°©ν΄λ°μ§ μλ λ²κ³Ό μ§μ€ λ°κ²¬ λ±μ κ·Όκ±°λ‘ νμ¬ ν
λ λΉμ 촬μμ λν μ λμ μΈ κΈμ§λ₯Ό νλ²μ ν©μΉνλ κ²μΌλ‘ 보μμ β κ·Έλ¬λ λλ¨Έμ§ 3μΈμ μ¬νκ΄μ νμ λ²μμ μμ‘μ μ°¨λ νΉλ³ν μΈκ²©κΆ 보νΈμ μ΄μ΅λ μμΌλ©°, ν
λ λΉμ 곡κ°μ£Όμλ‘ μΈν΄ λ²κ³Ό μ§μ€ λ°κ²¬μ κ³Όμ μ΄ μΈμ λ μνλ‘κ² λλ κ²μ μλλΌλ©΄μ λ°λμ견μ μ μν¨ β μλνλ©΄ νμ λ²μμ μμ‘μ μ°¨μμλ μμ‘λΉμ¬μκ° κ°μΈμ μΌλ‘ μ§μ μ¬λ¦¬μ μ°Έμν기보λ€λ λ³νΈμ¬κ° μ°Έμνλ κ²½μ°κ° λ§μΌλ©°, μ¬λ¦¬λμλ μ¬μ€λ¬Έμ κ° μλ λ²λ₯ λ¬Έμ κ° λλΆλΆμ΄κΈ° λλ¬Έμ΄λΌλ κ²μ β‘ ννΈ, μ°λ°©νλ²μ¬νμλ γμ°λ°©νλ²μ¬νμλ²γ(Bundesverfassungsgerichtsgesetz: BVerfGG) μ 17aμ‘°μ λ°λΌ μ νμ μ΄λλ§ μ¬νμ λν λ°©μ‘μ νμ©νκ³ μμ β γμ°λ°©νλ²μ¬νμλ²γ μ 17μ‘°μμ γλ²μμ‘°μ§λ²γ μ 14μ λ΄μ§ μ 16μ μ κ·μ μ μ€μ©νλλ‘ νκ³ μμ§λ§, λ
Ήμμ΄λ 촬μμ ν΅ν μ¬ν곡κ°μ κ΄λ ¨νμ¬μλ γλ²μμ‘°μ§λ²γκ³Ό λ€λ₯Έ λ΄μ©μ κ·μ νκ³ μμ', |
|
] |
|
embeddings = model.encode(sentences) |
|
print(embeddings.shape) |
|
# [3, 1024] |
|
|
|
# Get the similarity scores for the embeddings |
|
similarities = model.similarity(embeddings, embeddings) |
|
print(similarities) |
|
# tensor([[1.0000, 0.6721, 0.3897], |
|
# [0.6721, 1.0000, 0.3740], |
|
# [0.3897, 0.3740, 1.0000]]) |
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
- [ko-triplet-v1.0](nlpai-lab/ko-triplet-v1.0) |
|
- Korean query-document-hard_negative data pair (open data) |
|
- About 700000+ examples used totally |
|
|
|
### Training Procedure |
|
|
|
- **loss:** Used **[CachedMultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss)** by sentence-transformers |
|
- **batch size:** 512 |
|
- **learning rate:** 1e-05 |
|
- **epochs:** 1 |
|
|
|
## Evaluation |
|
### Metrics |
|
- NDCG@1, F1@1, NDCG@3, F1@3 |
|
### Benchmark Datasets |
|
- Ko-strategyQA |
|
- AutoRAG-benchmark |
|
- PublicHealthQA |
|
|
|
## Results |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65a4c4ed2548c41ad9b1421c/8toWmSrqH-aLKq1rSiqnv.png) |
|
|
|
## FAQ |
|
|
|
**1. Do I need to add the prefix "query: " and "passage: " to input texts?** |
|
|
|
Yes, this is how the model is trained, otherwise you will see a performance degradation. |
|
|
|
Here are some rules of thumb: |
|
- Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval. |
|
|
|
- Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval. |
|
|
|
- Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering. |
|
|
|
## Citation |
|
|
|
If you find our paper or models helpful, please consider cite as follows: |
|
```text |
|
@misc{KoE5, |
|
author = {NLP & AI Lab and Human-Inspired AI research}, |
|
title = {KoE5: A New Dataset and Model for Improving Korean Embedding Performance}, |
|
year = {2024}, |
|
publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee}, |
|
journal = {GitHub repository}, |
|
howpublished = {\url{https://github.com/nlpai-lab/KoE5}}, |
|
} |
|
``` |
|
``` |
|
@article{wang2024multilingual, |
|
title={Multilingual E5 Text Embeddings: A Technical Report}, |
|
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu}, |
|
journal={arXiv preprint arXiv:2402.05672}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
## Limitations |
|
|
|
Long texts will be truncated to at most 512 tokens. |