library_name: transformers
license: mit
datasets:
- nlpai-lab/ko-triplet-v1.0
language:
- ko
- en
base_model:
- intfloat/multilingual-e5-large
pipeline_tag: sentence-similarity
KoE5
KoE5: νκ΅μ΄ μλ² λ© μ±λ₯ ν₯μμ μν μλ‘μ΄ λ°μ΄ν°μ λ° λͺ¨λΈ - μ₯μμ€, μμ€μ, λ°μ°¬μ€, μ΄λ³κ΅¬, μ΄νλ―Ό, μν¬μ, HCLT 2024 Oral accepted
This model is fine-tuned model based on multilingual-e5-large with ko-triplet-v1.0
Uses
Direct Usage (Sentence Transformers)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the π€ Hub
model = SentenceTransformer("nlpai-lab/KoE5")
# Run inference
sentences = [
'query: νλ²κ³Ό λ²μμ‘°μ§λ²μ μ΄λ€ λ°©μμ ν΅ν΄ κΈ°λ³ΈκΆ λ³΄μ₯ λ±μ λ€μν λ²μ λͺ¨μμ κ°λ₯νκ² νμ΄',
'passage: 4. μμ¬μ κ³Ό κ°μ λ°©ν₯ μμ μ΄ν΄λ³Έ λ°μ κ°μ΄ μ°λ¦¬ νλ²κ³Ό ο½’λ²μμ‘°μ§ λ²ο½£μ λλ²μ ꡬμ±μ λ€μννμ¬ κΈ°λ³ΈκΆ λ³΄μ₯κ³Ό λ―Όμ£Όμ£Όμ ν립μ μμ΄ λ€κ°μ μΈ λ²μ λͺ¨μμ κ°λ₯νκ² νλ κ²μ κ·Όλ³Έ κ·λ²μΌλ‘ νκ³ μλ€. λμ±μ΄ ν©μ체λ‘μμ λλ²μ μ리λ₯Ό μ±ννκ³ μλ κ² μμ κ·Έ ꡬμ±μ λ€μμ±μ μμ²νλ κ²μΌλ‘ ν΄μλλ€. μ΄μ κ°μ κ΄μ μμ λ³Ό λ νμ§ λ²μμ₯κΈ κ³ μλ²κ΄μ μ€μ¬μΌλ‘ λλ²μμ ꡬμ±νλ κ΄νμ κ°μ ν νμκ° μλ κ²μΌλ‘ 보μΈλ€.',
'passage: β‘ μ°λ°©νλ²μ¬νμλ 2001λ
1μ 24μΌ 5:3μ λ€μ견ν΄λ‘ γλ²μμ‘°μ§λ²γ μ 169μ‘° μ 2λ¬Έμ΄ νλ²μ ν©μΉλλ€λ νκ²°μ λ΄λ Έμ β 5μΈμ λ€μ μ¬νκ΄μ μμ‘κ΄κ³μΈμ μΈκ²©κΆ 보νΈ, 곡μ ν μ μ°¨μ 보μ₯κ³Ό λ°©ν΄λ°μ§ μλ λ²κ³Ό μ§μ€ λ°κ²¬ λ±μ κ·Όκ±°λ‘ νμ¬ ν
λ λΉμ 촬μμ λν μ λμ μΈ κΈμ§λ₯Ό νλ²μ ν©μΉνλ κ²μΌλ‘ 보μμ β κ·Έλ¬λ λλ¨Έμ§ 3μΈμ μ¬νκ΄μ νμ λ²μμ μμ‘μ μ°¨λ νΉλ³ν μΈκ²©κΆ 보νΈμ μ΄μ΅λ μμΌλ©°, ν
λ λΉμ 곡κ°μ£Όμλ‘ μΈν΄ λ²κ³Ό μ§μ€ λ°κ²¬μ κ³Όμ μ΄ μΈμ λ μνλ‘κ² λλ κ²μ μλλΌλ©΄μ λ°λμ견μ μ μν¨ β μλνλ©΄ νμ λ²μμ μμ‘μ μ°¨μμλ μμ‘λΉμ¬μκ° κ°μΈμ μΌλ‘ μ§μ μ¬λ¦¬μ μ°Έμν기보λ€λ λ³νΈμ¬κ° μ°Έμνλ κ²½μ°κ° λ§μΌλ©°, μ¬λ¦¬λμλ μ¬μ€λ¬Έμ κ° μλ λ²λ₯ λ¬Έμ κ° λλΆλΆμ΄κΈ° λλ¬Έμ΄λΌλ κ²μ β‘ ννΈ, μ°λ°©νλ²μ¬νμλ γμ°λ°©νλ²μ¬νμλ²γ(Bundesverfassungsgerichtsgesetz: BVerfGG) μ 17aμ‘°μ λ°λΌ μ νμ μ΄λλ§ μ¬νμ λν λ°©μ‘μ νμ©νκ³ μμ β γμ°λ°©νλ²μ¬νμλ²γ μ 17μ‘°μμ γλ²μμ‘°μ§λ²γ μ 14μ λ΄μ§ μ 16μ μ κ·μ μ μ€μ©νλλ‘ νκ³ μμ§λ§, λ
Ήμμ΄λ 촬μμ ν΅ν μ¬ν곡κ°μ κ΄λ ¨νμ¬μλ γλ²μμ‘°μ§λ²γκ³Ό λ€λ₯Έ λ΄μ©μ κ·μ νκ³ μμ',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6721, 0.3897],
# [0.6721, 1.0000, 0.3740],
# [0.3897, 0.3740, 1.0000]])
FAQ
1. Do I need to add the prefix "query: " and "passage: " to input texts?
Yes, this is how the model is trained, otherwise you will see a performance degradation.
Here are some rules of thumb:
Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
Citation
If you find our paper or models helpful, please consider cite as follows:
@article{wang2024multilingual,
title={Multilingual E5 Text Embeddings: A Technical Report},
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
journal={arXiv preprint arXiv:2402.05672},
year={2024}
}
Limitations
Long texts will be truncated to at most 512 tokens.