File size: 5,828 Bytes
3a67bbd 7656f9a 3a67bbd 7656f9a 3a67bbd b940fa1 3a67bbd b940fa1 3a67bbd b940fa1 3a67bbd b940fa1 3a67bbd b940fa1 3a67bbd b940fa1 7656f9a 3a67bbd 7656f9a b940fa1 7656f9a 3a67bbd 7656f9a 3a67bbd 7656f9a 3a67bbd 7656f9a 3a67bbd b940fa1 3a1effd b940fa1 7656f9a 3a67bbd 7656f9a 3a67bbd 7656f9a 3a67bbd 7656f9a 3a67bbd 7656f9a 3a67bbd 7656f9a 3a67bbd 7656f9a 3a67bbd 7656f9a b940fa1 7656f9a 3a67bbd 7656f9a 3a67bbd 7656f9a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
---
library_name: transformers
license: mit
datasets:
- nlpai-lab/ko-triplet-v1.0
language:
- ko
- en
base_model:
- intfloat/multilingual-e5-large
pipeline_tag: sentence-similarity
---
# KoE5
Introducing KoE5, a model with advanced retrieval abilities.
It has shown remarkable performance in Korean text retrieval, speficially overwhelming most multilingual embedding models.
To our knowledge, It is one of the best publicly opened Korean retrieval models.
For details, visit the [KoE5 repository](https://github.com/nlpai-lab/KoE5)
### Model Description
This is the model card of a π€ transformers model that has been pushed on the Hub.
- **Developed by:** [NLP&AI Lab](http://nlp.korea.ac.kr/)
- **Language(s) (NLP):** Korean, English
- **License:** MIT
- **Finetuned from model:** [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)
- **Finetuned dataset:** [ko-triplet-v1.0](nlpai-lab/ko-triplet-v1.0)
## Example code
### Install Dependencies
First install the Sentence Transformers library:
```bash
pip install -U sentence-transformers
```
### Python code
Then you can load this model and run inference.
```python
from sentence_transformers import SentenceTransformer
# Download from the π€ Hub
model = SentenceTransformer("nlpai-lab/KoE5")
# Run inference
sentences = [
'query: νλ²κ³Ό λ²μμ‘°μ§λ²μ μ΄λ€ λ°©μμ ν΅ν΄ κΈ°λ³ΈκΆ λ³΄μ₯ λ±μ λ€μν λ²μ λͺ¨μμ κ°λ₯νκ² νμ΄',
'passage: 4. μμ¬μ κ³Ό κ°μ λ°©ν₯ μμ μ΄ν΄λ³Έ λ°μ κ°μ΄ μ°λ¦¬ νλ²κ³Ό ο½’λ²μμ‘°μ§ λ²ο½£μ λλ²μ ꡬμ±μ λ€μννμ¬ κΈ°λ³ΈκΆ λ³΄μ₯κ³Ό λ―Όμ£Όμ£Όμ ν립μ μμ΄ λ€κ°μ μΈ λ²μ λͺ¨μμ κ°λ₯νκ² νλ κ²μ κ·Όλ³Έ κ·λ²μΌλ‘ νκ³ μλ€. λμ±μ΄ ν©μ체λ‘μμ λλ²μ μ리λ₯Ό μ±ννκ³ μλ κ² μμ κ·Έ ꡬμ±μ λ€μμ±μ μμ²νλ κ²μΌλ‘ ν΄μλλ€. μ΄μ κ°μ κ΄μ μμ λ³Ό λ νμ§ λ²μμ₯κΈ κ³ μλ²κ΄μ μ€μ¬μΌλ‘ λλ²μμ ꡬμ±νλ κ΄νμ κ°μ ν νμκ° μλ κ²μΌλ‘ 보μΈλ€.',
'passage: β‘ μ°λ°©νλ²μ¬νμλ 2001λ
1μ 24μΌ 5:3μ λ€μ견ν΄λ‘ γλ²μμ‘°μ§λ²γ μ 169μ‘° μ 2λ¬Έμ΄ νλ²μ ν©μΉλλ€λ νκ²°μ λ΄λ Έμ β 5μΈμ λ€μ μ¬νκ΄μ μμ‘κ΄κ³μΈμ μΈκ²©κΆ 보νΈ, 곡μ ν μ μ°¨μ 보μ₯κ³Ό λ°©ν΄λ°μ§ μλ λ²κ³Ό μ§μ€ λ°κ²¬ λ±μ κ·Όκ±°λ‘ νμ¬ ν
λ λΉμ 촬μμ λν μ λμ μΈ κΈμ§λ₯Ό νλ²μ ν©μΉνλ κ²μΌλ‘ 보μμ β κ·Έλ¬λ λλ¨Έμ§ 3μΈμ μ¬νκ΄μ νμ λ²μμ μμ‘μ μ°¨λ νΉλ³ν μΈκ²©κΆ 보νΈμ μ΄μ΅λ μμΌλ©°, ν
λ λΉμ 곡κ°μ£Όμλ‘ μΈν΄ λ²κ³Ό μ§μ€ λ°κ²¬μ κ³Όμ μ΄ μΈμ λ μνλ‘κ² λλ κ²μ μλλΌλ©΄μ λ°λμ견μ μ μν¨ β μλνλ©΄ νμ λ²μμ μμ‘μ μ°¨μμλ μμ‘λΉμ¬μκ° κ°μΈμ μΌλ‘ μ§μ μ¬λ¦¬μ μ°Έμν기보λ€λ λ³νΈμ¬κ° μ°Έμνλ κ²½μ°κ° λ§μΌλ©°, μ¬λ¦¬λμλ μ¬μ€λ¬Έμ κ° μλ λ²λ₯ λ¬Έμ κ° λλΆλΆμ΄κΈ° λλ¬Έμ΄λΌλ κ²μ β‘ ννΈ, μ°λ°©νλ²μ¬νμλ γμ°λ°©νλ²μ¬νμλ²γ(Bundesverfassungsgerichtsgesetz: BVerfGG) μ 17aμ‘°μ λ°λΌ μ νμ μ΄λλ§ μ¬νμ λν λ°©μ‘μ νμ©νκ³ μμ β γμ°λ°©νλ²μ¬νμλ²γ μ 17μ‘°μμ γλ²μμ‘°μ§λ²γ μ 14μ λ΄μ§ μ 16μ μ κ·μ μ μ€μ©νλλ‘ νκ³ μμ§λ§, λ
Ήμμ΄λ 촬μμ ν΅ν μ¬ν곡κ°μ κ΄λ ¨νμ¬μλ γλ²μμ‘°μ§λ²γκ³Ό λ€λ₯Έ λ΄μ©μ κ·μ νκ³ μμ',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6721, 0.3897],
# [0.6721, 1.0000, 0.3740],
# [0.3897, 0.3740, 1.0000]])
```
## Training Details
### Training Data
- [ko-triplet-v1.0](nlpai-lab/ko-triplet-v1.0)
- Korean query-document-hard_negative data pair (open data)
- About 700000+ examples used totally
### Training Procedure
- **loss:** Used **[CachedMultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss)** by sentence-transformers
- **batch size:** 512
- **learning rate:** 1e-05
- **epochs:** 1
## Evaluation
### Metrics
- NDCG@1, F1@1, NDCG@3, F1@3
### Benchmark Datasets
- Ko-strategyQA
- AutoRAG-benchmark
- PublicHealthQA
## Results
<img src="KoE5-results.png" width=100%>
## FAQ
**1. Do I need to add the prefix "query: " and "passage: " to input texts?**
Yes, this is how the model is trained, otherwise you will see a performance degradation.
Here are some rules of thumb:
- Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
- Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
- Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
## Citation
If you find our paper or models helpful, please consider cite as follows:
```text
@misc{KoE5,
author = {NLP & AI Lab and Human-Inspired AI research},
title = {KoE5: A New Dataset and Model for Improving Korean Embedding Performance},
year = {2024},
publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee},
journal = {GitHub repository},
howpublished = {\url{https://github.com/nlpai-lab/KoE5}},
}
```
```
@article{wang2024multilingual,
title={Multilingual E5 Text Embeddings: A Technical Report},
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
journal={arXiv preprint arXiv:2402.05672},
year={2024}
}
```
## Limitations
Long texts will be truncated to at most 512 tokens. |