nlpai-lab
/

KoE5

@@ -13,21 +13,30 @@ pipeline_tag: sentence-similarity
 # KoE5
-**KoE5: 한국어 임베딩 성능 향상을 위한 새로운 데이터셋 및 모델** - 장영준, 손준영, 박찬준, 이병구, 이태민, 임희석, HCLT 2024 Oral accepted
-This model is fine-tuned model based on multilingual-e5-large with [ko-triplet-v1.0](nlpai-lab/ko-triplet-v1.0)
-## Uses
-### Direct Usage (Sentence Transformers)
 First install the Sentence Transformers library:
 ```bash
 pip install -U sentence-transformers
 ```
 Then you can load this model and run inference.
 ```python
 from sentence_transformers import SentenceTransformer
@@ -53,6 +62,33 @@ print(similarities)
 #        [0.3897, 0.3740, 1.0000]])
 ```
 ## FAQ
 **1. Do I need to add the prefix "query: " and "passage: " to input texts?**
@@ -69,7 +105,16 @@ Here are some rules of thumb:
 ## Citation
 If you find our paper or models helpful, please consider cite as follows:
 ```
 @article{wang2024multilingual,
   title={Multilingual E5 Text Embeddings: A Technical Report},

 # KoE5
+Introducing KoE5, a model with advanced retrieval abilities.
+It has shown remarkable performance in Korean text retrieval, speficially overwhelming most multilingual embedding models.
+To our knowledge, It is one of the best publicly opened Korean retrieval models.
+For details, visit the [KoE5 repository](https://github.com/nlpai-lab/KoE5)
+### Model Description
+This is the model card of a 🤗 transformers model that has been pushed on the Hub.
+- **Developed by:** [NLP&AI Lab](http://nlp.korea.ac.kr/)
+- **Language(s) (NLP):** Korean, English
+- **License:** MIT
+- **Finetuned from model:** [intfloat/multilingual-e5-large](https://huggingface.co/intfloat/multilingual-e5-large)
+- **Finetuned dataset:** [ko-triplet-v1.0](nlpai-lab/ko-triplet-v1.0)
+## Example code
+### Install Dependencies
 First install the Sentence Transformers library:
 ```bash
 pip install -U sentence-transformers
 ```
+### Python code
 Then you can load this model and run inference.
 ```python
 from sentence_transformers import SentenceTransformer
 #        [0.3897, 0.3740, 1.0000]])
 ```
+## Training Details
+### Training Data
+- [ko-triplet-v1.0](nlpai-lab/ko-triplet-v1.0)
+- Korean query-document-hard_negative data pair (open data)
+- About 700000+ examples used totally
+### Training Procedure
+- **loss:** Used **[CachedMultipleNegativesRankingLoss](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#cachedmultiplenegativesrankingloss)** by sentence-transformers
+- **batch size:** 512
+- **learning rate:** 1e-05
+- **epochs:** 1
+## Evaluation
+### Metrics
+- NDCG@1, F1@1, NDCG@3, F1@3
+### Benchmark Datasets
+- Ko-strategyQA
+- AutoRAG-benchmark
+- PublicHealthQA
+## Results
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/65a4c4ed2548c41ad9b1421c/8toWmSrqH-aLKq1rSiqnv.png)
 ## FAQ
 **1. Do I need to add the prefix "query: " and "passage: " to input texts?**
 ## Citation
 If you find our paper or models helpful, please consider cite as follows:
+```text
+@misc{KoE5,
+  author = {NLP & AI Lab and Human-Inspired AI research},
+  title = {KoE5: A New Dataset and Model for Improving Korean Embedding Performance},
+  year = {2024},
+  publisher = {Youngjoon Jang, Junyoung Son, Taemin Lee},
+  journal = {GitHub repository},
+  howpublished = {\url{https://github.com/nlpai-lab/KoE5}},
+}
+```
 ```
 @article{wang2024multilingual,
   title={Multilingual E5 Text Embeddings: A Technical Report},