KoE5 / README.md
yjoonjang's picture
Update README.md
7656f9a verified
|
raw
history blame
4.37 kB
metadata
library_name: transformers
license: mit
datasets:
  - nlpai-lab/ko-triplet-v1.0
language:
  - ko
  - en
base_model:
  - intfloat/multilingual-e5-large
pipeline_tag: sentence-similarity

KoE5

KoE5: ν•œκ΅­μ–΄ μž„λ² λ”© μ„±λŠ₯ ν–₯상을 μœ„ν•œ μƒˆλ‘œμš΄ 데이터셋 및 λͺ¨λΈ - μž₯μ˜μ€€, μ†μ€€μ˜, λ°•μ°¬μ€€, 이병ꡬ, μ΄νƒœλ―Ό, μž„ν¬μ„, HCLT 2024 Oral accepted

This model is fine-tuned model based on multilingual-e5-large with ko-triplet-v1.0

Uses

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the πŸ€— Hub
model = SentenceTransformer("nlpai-lab/KoE5")

# Run inference
sentences = [
    'query: ν—Œλ²•κ³Ό 법원쑰직법은 μ–΄λ–€ 방식을 톡해 기본ꢌ 보μž₯ λ“±μ˜ λ‹€μ–‘ν•œ 법적 λͺ¨μƒ‰μ„ κ°€λŠ₯ν•˜κ²Œ ν–ˆμ–΄',
    'passage: 4. μ‹œμ‚¬μ κ³Ό κ°œμ„ λ°©ν–₯ μ•žμ„œ μ‚΄νŽ΄λ³Έ 바와 같이 우리 ν—Œλ²•κ³Ό r법원쑰직 법」은 λŒ€λ²•μ› ꡬ성을 λ‹€μ–‘ν™”ν•˜μ—¬ 기본ꢌ 보μž₯κ³Ό 민주주의 확립에 μžˆμ–΄ 닀각적인 법적 λͺ¨μƒ‰μ„ κ°€λŠ₯ν•˜κ²Œ ν•˜λŠ” 것을 κ·Όλ³Έ κ·œλ²”μœΌλ‘œ ν•˜κ³  μžˆλ‹€. λ”μš±μ΄ ν•©μ˜μ²΄λ‘œμ„œμ˜ λŒ€λ²•μ› 원리λ₯Ό μ±„νƒν•˜κ³  μžˆλŠ” 것 μ—­μ‹œ κ·Έ κ΅¬μ„±μ˜ 닀양성을 μš”μ²­ν•˜λŠ” κ²ƒμœΌλ‘œ ν•΄μ„λœλ‹€. 이와 같은 κ΄€μ μ—μ„œ λ³Ό λ•Œ ν˜„μ§ 법원μž₯κΈ‰ κ³ μœ„λ²•κ΄€μ„ μ€‘μ‹¬μœΌλ‘œ λŒ€λ²•μ›μ„ κ΅¬μ„±ν•˜λŠ” 관행은 κ°œμ„ ν•  ν•„μš”κ°€ μžˆλŠ” κ²ƒμœΌλ‘œ 보인닀.',
    'passage: β–‘ μ—°λ°©ν—Œλ²•μž¬νŒμ†ŒλŠ” 2001λ…„ 1μ›” 24일 5:3의 λ‹€μˆ˜κ²¬ν•΄λ‘œ γ€Œλ²•μ›μ‘°μ§λ²•γ€ 제169μ‘° 제2문이 ν—Œλ²•μ— ν•©μΉ˜λœλ‹€λŠ” νŒκ²°μ„ λ‚΄λ ΈμŒ β—‹ 5인의 λ‹€μˆ˜ μž¬νŒκ΄€μ€ μ†Œμ†‘κ΄€κ³„μΈμ˜ 인격ꢌ 보호, κ³΅μ •ν•œ 절차의 보μž₯κ³Ό 방해받지 μ•ŠλŠ” 법과 진싀 발견 등을 근거둜 ν•˜μ—¬ ν…”λ ˆλΉ„μ „ μ΄¬μ˜μ— λŒ€ν•œ μ ˆλŒ€μ μΈ κΈˆμ§€λ₯Ό ν—Œλ²•μ— ν•©μΉ˜ν•˜λŠ” κ²ƒμœΌλ‘œ λ³΄μ•˜μŒ β—‹ κ·ΈλŸ¬λ‚˜ λ‚˜λ¨Έμ§€ 3인의 μž¬νŒκ΄€μ€ ν–‰μ •λ²•μ›μ˜ μ†Œμ†‘μ ˆμ°¨λŠ” νŠΉλ³„ν•œ 인격ꢌ 보호의 이읡도 μ—†μœΌλ©°, ν…”λ ˆλΉ„μ „ 곡개주의둜 인해 법과 진싀 발견의 과정이 μ–Έμ œλ‚˜ μœ„νƒœλ‘­κ²Œ λ˜λŠ” 것은 μ•„λ‹ˆλΌλ©΄μ„œ λ°˜λŒ€μ˜κ²¬μ„ μ œμ‹œν•¨ β—‹ μ™œλƒν•˜λ©΄ ν–‰μ •λ²•μ›μ˜ μ†Œμ†‘μ ˆμ°¨μ—μ„œλŠ” μ†Œμ†‘λ‹Ήμ‚¬μžκ°€ 개인적으둜 직접 심리에 μ°Έμ„ν•˜κΈ°λ³΄λ‹€λŠ” λ³€ν˜Έμ‚¬κ°€ μ°Έμ„ν•˜λŠ” κ²½μš°κ°€ 많으며, μ‹¬λ¦¬λŒ€μƒλ„ μ‚¬μ‹€λ¬Έμ œκ°€ μ•„λ‹Œ 법λ₯ λ¬Έμ œκ°€ λŒ€λΆ€λΆ„μ΄κΈ° λ•Œλ¬Έμ΄λΌλŠ” κ²ƒμž„ β–‘ ν•œνŽΈ, μ—°λ°©ν—Œλ²•μž¬νŒμ†ŒλŠ” γ€Œμ—°λ°©ν—Œλ²•μž¬νŒμ†Œλ²•γ€(Bundesverfassungsgerichtsgesetz: BVerfGG) 제17a쑰에 따라 μ œν•œμ μ΄λ‚˜λ§ˆ μž¬νŒμ— λŒ€ν•œ 방솑을 ν—ˆμš©ν•˜κ³  있음 β—‹ γ€Œμ—°λ°©ν—Œλ²•μž¬νŒμ†Œλ²•γ€ 제17μ‘°μ—μ„œ γ€Œλ²•μ›μ‘°μ§λ²•γ€ 제14절 내지 제16절의 κ·œμ •μ„ μ€€μš©ν•˜λ„λ‘ ν•˜κ³  μžˆμ§€λ§Œ, λ…ΉμŒμ΄λ‚˜ μ΄¬μ˜μ„ ν†΅ν•œ μž¬νŒκ³΅κ°œμ™€ κ΄€λ ¨ν•˜μ—¬μ„œλŠ” γ€Œλ²•μ›μ‘°μ§λ²•γ€κ³Ό λ‹€λ₯Έ λ‚΄μš©μ„ κ·œμ •ν•˜κ³  있음',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.6721, 0.3897],
#        [0.6721, 1.0000, 0.3740],
#        [0.3897, 0.3740, 1.0000]])

FAQ

1. Do I need to add the prefix "query: " and "passage: " to input texts?

Yes, this is how the model is trained, otherwise you will see a performance degradation.

Here are some rules of thumb:

  • Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.

  • Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.

  • Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.

Citation

If you find our paper or models helpful, please consider cite as follows:

@article{wang2024multilingual,
  title={Multilingual E5 Text Embeddings: A Technical Report},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2402.05672},
  year={2024}
}

Limitations

Long texts will be truncated to at most 512 tokens.