library_name: transformers
license: mit
datasets:
- nlpai-lab/ko-triplet-v1.0
language:
- ko
- en
base_model:
- intfloat/multilingual-e5-large
pipeline_tag: sentence-similarity
KoE5
Introducing KoE5, a model with advanced retrieval abilities.
It has shown remarkable performance in Korean text retrieval, speficially overwhelming most multilingual embedding models.
To our knowledge, It is one of the best publicly opened Korean retrieval models.
For details, visit the KoE5 repository
Model Description
This is the model card of a π€ transformers model that has been pushed on the Hub.
- Developed by: NLP&AI Lab
- Language(s) (NLP): Korean, English
- License: MIT
- Finetuned from model: intfloat/multilingual-e5-large
- Finetuned dataset: ko-triplet-v1.0
Example code
Install Dependencies
First install the Sentence Transformers library:
pip install -U sentence-transformers
Python code
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the π€ Hub
model = SentenceTransformer("nlpai-lab/KoE5")
# Run inference
sentences = [
'query: νλ²κ³Ό λ²μμ‘°μ§λ²μ μ΄λ€ λ°©μμ ν΅ν΄ κΈ°λ³ΈκΆ λ³΄μ₯ λ±μ λ€μν λ²μ λͺ¨μμ κ°λ₯νκ² νμ΄',
'passage: 4. μμ¬μ κ³Ό κ°μ λ°©ν₯ μμ μ΄ν΄λ³Έ λ°μ κ°μ΄ μ°λ¦¬ νλ²κ³Ό ο½’λ²μμ‘°μ§ λ²ο½£μ λλ²μ ꡬμ±μ λ€μννμ¬ κΈ°λ³ΈκΆ λ³΄μ₯κ³Ό λ―Όμ£Όμ£Όμ ν립μ μμ΄ λ€κ°μ μΈ λ²μ λͺ¨μμ κ°λ₯νκ² νλ κ²μ κ·Όλ³Έ κ·λ²μΌλ‘ νκ³ μλ€. λμ±μ΄ ν©μ체λ‘μμ λλ²μ μ리λ₯Ό μ±ννκ³ μλ κ² μμ κ·Έ ꡬμ±μ λ€μμ±μ μμ²νλ κ²μΌλ‘ ν΄μλλ€. μ΄μ κ°μ κ΄μ μμ λ³Ό λ νμ§ λ²μμ₯κΈ κ³ μλ²κ΄μ μ€μ¬μΌλ‘ λλ²μμ ꡬμ±νλ κ΄νμ κ°μ ν νμκ° μλ κ²μΌλ‘ 보μΈλ€.',
'passage: β‘ μ°λ°©νλ²μ¬νμλ 2001λ
1μ 24μΌ 5:3μ λ€μ견ν΄λ‘ γλ²μμ‘°μ§λ²γ μ 169μ‘° μ 2λ¬Έμ΄ νλ²μ ν©μΉλλ€λ νκ²°μ λ΄λ Έμ β 5μΈμ λ€μ μ¬νκ΄μ μμ‘κ΄κ³μΈμ μΈκ²©κΆ 보νΈ, 곡μ ν μ μ°¨μ 보μ₯κ³Ό λ°©ν΄λ°μ§ μλ λ²κ³Ό μ§μ€ λ°κ²¬ λ±μ κ·Όκ±°λ‘ νμ¬ ν
λ λΉμ 촬μμ λν μ λμ μΈ κΈμ§λ₯Ό νλ²μ ν©μΉνλ κ²μΌλ‘ 보μμ β κ·Έλ¬λ λλ¨Έμ§ 3μΈμ μ¬νκ΄μ νμ λ²μμ μμ‘μ μ°¨λ νΉλ³ν μΈκ²©κΆ 보νΈμ μ΄μ΅λ μμΌλ©°, ν
λ λΉμ 곡κ°μ£Όμλ‘ μΈν΄ λ²κ³Ό μ§μ€ λ°κ²¬μ κ³Όμ μ΄ μΈμ λ μνλ‘κ² λλ κ²μ μλλΌλ©΄μ λ°λμ견μ μ μν¨ β μλνλ©΄ νμ λ²μμ μμ‘μ μ°¨μμλ μμ‘λΉμ¬μκ° κ°μΈμ μΌλ‘ μ§μ μ¬λ¦¬μ μ°Έμν기보λ€λ λ³νΈμ¬κ° μ°Έμνλ κ²½μ°κ° λ§μΌλ©°, μ¬λ¦¬λμλ μ¬μ€λ¬Έμ κ° μλ λ²λ₯ λ¬Έμ κ° λλΆλΆμ΄κΈ° λλ¬Έμ΄λΌλ κ²μ β‘ ννΈ, μ°λ°©νλ²μ¬νμλ γμ°λ°©νλ²μ¬νμλ²γ(Bundesverfassungsgerichtsgesetz: BVerfGG) μ 17aμ‘°μ λ°λΌ μ νμ μ΄λλ§ μ¬νμ λν λ°©μ‘μ νμ©νκ³ μμ β γμ°λ°©νλ²μ¬νμλ²γ μ 17μ‘°μμ γλ²μμ‘°μ§λ²γ μ 14μ λ΄μ§ μ 16μ μ κ·μ μ μ€μ©νλλ‘ νκ³ μμ§λ§, λ
Ήμμ΄λ 촬μμ ν΅ν μ¬ν곡κ°μ κ΄λ ¨νμ¬μλ γλ²μμ‘°μ§λ²γκ³Ό λ€λ₯Έ λ΄μ©μ κ·μ νκ³ μμ',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6721, 0.3897],
# [0.6721, 1.0000, 0.3740],
# [0.3897, 0.3740, 1.0000]])
Training Details
Training Data
- ko-triplet-v1.0
- Korean query-document-hard_negative data pair (open data)
- About 700000+ examples used totally
Training Procedure
- loss: Used CachedMultipleNegativesRankingLoss by sentence-transformers
- batch size: 512
- learning rate: 1e-05
- epochs: 1
Evaluation
Metrics
- NDCG@1, F1@1, NDCG@3, F1@3
Benchmark Datasets
- Ko-strategyQA
- AutoRAG-benchmark
- PublicHealthQA
Results
FAQ
1. Do I need to add the prefix "query: " and "passage: " to input texts?
Yes, this is how the model is trained, otherwise you will see a performance degradation.
Here are some rules of thumb:
Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
Citation
If you find our paper or models helpful, please consider cite as follows:
@misc{KoE5,
author = {NLP & AI Lab and Human-Inspired AI research},
title = {KoE5: A New Dataset and Model for Improving Korean Embedding Performance},
year = {2024},
publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee},
journal = {GitHub repository},
howpublished = {\url{https://github.com/nlpai-lab/KoE5}},
}
@article{wang2024multilingual,
title={Multilingual E5 Text Embeddings: A Technical Report},
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
journal={arXiv preprint arXiv:2402.05672},
year={2024}
}
Limitations
Long texts will be truncated to at most 512 tokens.