BM-K commited on
Commit
e28fd19
ยท
1 Parent(s): def02ad

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +214 -0
README.md ADDED
@@ -0,0 +1,214 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Korean-Sentence-Embedding
2
+ Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides environments where individuals can train models.
3
+
4
+ ## Quick tour
5
+ > **Note** <br>
6
+ > All the pretrained models are uploaded in Huggingface Model Hub. Check https://huggingface.co/BM-K
7
+ ```python
8
+ import torch
9
+ from transformers import AutoModel, AutoTokenizer
10
+
11
+ def cal_score(a, b):
12
+ if len(a.shape) == 1: a = a.unsqueeze(0)
13
+ if len(b.shape) == 1: b = b.unsqueeze(0)
14
+
15
+ a_norm = a / a.norm(dim=1)[:, None]
16
+ b_norm = b / b.norm(dim=1)[:, None]
17
+ return torch.mm(a_norm, b_norm.transpose(0, 1)) * 100
18
+
19
+ model = AutoModel.from_pretrained('BM-K/KoSimCSE-roberta-multitask') # or 'BM-K/KoSimCSE-bert-multitask'
20
+ tokenizer = AutoTokenizer.from_pretrained('BM-K/KoSimCSE-roberta-multitask') # or 'BM-K/KoSimCSE-bert-multitask'
21
+
22
+ sentences = ['์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.',
23
+ '์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค.',
24
+ '์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค.']
25
+
26
+ inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
27
+ embeddings, _ = model(**inputs, return_dict=False)
28
+
29
+ score01 = cal_score(embeddings[0][0], embeddings[1][0]) # 84.09
30
+ # '์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.' @ '์น˜ํƒ€ ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋จน์ด ๋’ค์—์„œ ๋‹ฌ๋ฆฌ๊ณ  ์žˆ๋‹ค.'
31
+ score02 = cal_score(embeddings[0][0], embeddings[2][0]) # 23.21
32
+ # '์น˜ํƒ€๊ฐ€ ๋“คํŒ์„ ๊ฐ€๋กœ ์งˆ๋Ÿฌ ๋จน์ด๋ฅผ ์ซ“๋Š”๋‹ค.' @ '์›์ˆญ์ด ํ•œ ๋งˆ๋ฆฌ๊ฐ€ ๋“œ๋Ÿผ์„ ์—ฐ์ฃผํ•œ๋‹ค.'
33
+ ```
34
+
35
+ ## Update history
36
+
37
+ ** Updates on Mar.08.2023 **
38
+ - Update Unsupervised Models
39
+
40
+ ** Updates on Feb.24.2023 **
41
+ - Upload KoSimCSE clustering example
42
+
43
+ ** Updates on Nov.15.2022 **
44
+ - Upload KoDiffCSE-unsupervised training code
45
+
46
+ ** Updates on Oct.27.2022 **
47
+ - Upload KoDiffCSE-unsupervised performance
48
+
49
+ ** Updates on Oct.21.2022 **
50
+ - Upload KoSimCSE-unsupervised performance
51
+
52
+ ** Updates on Jun.01.2022 **
53
+ - Release KoSimCSE-multitask models
54
+
55
+ ** Updates on May.23.2022 **
56
+ - Upload KoSentenceT5 training code
57
+ - Upload KoSentenceT5 performance
58
+
59
+ ** Updates on Mar.01.2022 **
60
+ - Release KoSimCSE
61
+
62
+ ** Updates on Feb.11.2022 **
63
+ - Upload KoSimCSE training code
64
+ - Upload KoSimCSE performance
65
+
66
+ ** Updates on Jan.26.2022 **
67
+ - Upload KoSBERT training code
68
+ - Upload KoSBERT performance
69
+
70
+ ## Baseline Models
71
+ Baseline models used for korean sentence embedding - [KLUE-PLMs](https://github.com/KLUE-benchmark/KLUE/blob/main/README.md)
72
+
73
+ | Model | Embedding size | Hidden size | # Layers | # Heads |
74
+ |----------------------|----------------|-------------|----------|---------|
75
+ | KLUE-BERT-base | 768 | 768 | 12 | 12 |
76
+ | KLUE-RoBERTa-base | 768 | 768 | 12 | 12 |
77
+
78
+ > **Warning** <br>
79
+ > Large pre-trained models need a lot of GPU memory to train
80
+
81
+ ## Available Models
82
+ 1. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks [[SBERT]-[EMNLP 2019]](https://arxiv.org/abs/1908.10084)
83
+ 2. SimCSE: Simple Contrastive Learning of Sentence Embeddings [[SimCSE]-[EMNLP 2021]](https://arxiv.org/abs/2104.08821)
84
+ 3. Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models [[Sentence-T5]-[ACL findings 2022]](https://arxiv.org/abs/2108.08877)
85
+ 4. DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings [[DiffCSE]-[NAACL 2022]](https://arxiv.org/abs/2204.10298)
86
+
87
+ ## Datasets
88
+ - [kakaobrain KorNLU Datasets](https://github.com/kakaobrain/KorNLUDatasets) (Supervised setting)
89
+ - [wiki-corpus](https://github.com/jeongukjae/korean-wikipedia-corpus) (Unsupervised setting)
90
+
91
+ ## Setups
92
+ [![Python](https://img.shields.io/badge/python-3.8.5-blue?logo=python&logoColor=FED643)](https://www.python.org/downloads/release/python-385/)
93
+ [![Pytorch](https://img.shields.io/badge/pytorch-1.7.1-red?logo=pytorch)](https://pytorch.org/get-started/previous-versions/)
94
+
95
+ ### KoSentenceBERT
96
+ - ๐Ÿค— [Model Training](https://github.com/BM-K/Sentence-Embedding-is-all-you-need/tree/main/KoSBERT)
97
+ - Dataset (Supervised)
98
+ - Training: snli_1.0_train.ko.tsv, sts-train.tsv (multi-task)
99
+ - Performance can be further improved by adding multinli data to training.
100
+ - Validation: sts-dev.tsv
101
+ - Test: sts-test.tsv
102
+
103
+ ### KoSimCSE
104
+ - ๐Ÿค— [Model Training](https://github.com/BM-K/Sentence-Embedding-is-all-you-need/tree/main/KoSimCSE)
105
+ - Dataset (Supervised)
106
+ - Training: snli_1.0_train.ko.tsv + multinli.train.ko.tsv (Supervised setting)
107
+ - Validation: sts-dev.tsv
108
+ - Test: sts-test.tsv
109
+ - Dataset (Unsupervised)
110
+ - Training: wiki_corpus.txt
111
+ - Validation: sts-dev.tsv
112
+ - Test: sts-test.tsv
113
+
114
+ ### KoSentenceT5
115
+ - ๐Ÿค— [Model Training](https://github.com/BM-K/Sentence-Embedding-is-all-you-need/tree/main/KoSentenceT5)
116
+ - Dataset (Supervised)
117
+ - Training: snli_1.0_train.ko.tsv + multinli.train.ko.tsv
118
+ - Validation: sts-dev.tsv
119
+ - Test: sts-test.tsv
120
+
121
+ ### KoDiffCSE
122
+ - ๐Ÿค— [Model Training](https://github.com/BM-K/KoDiffCSE)
123
+ - Dataset (Unsupervised)
124
+ - Training: wiki_corpus.txt
125
+ - Validation: sts-dev.tsv
126
+ - Test: sts-test.tsv
127
+
128
+ ## Performance-supervised
129
+
130
+ | Model | Average | Cosine Pearson | Cosine Spearman | Euclidean Pearson | Euclidean Spearman | Manhattan Pearson | Manhattan Spearman | Dot Pearson | Dot Spearman |
131
+ |------------------------|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
132
+ | KoSBERT<sup>โ€ </sup><sub>SKT</sub> | 77.40 | 78.81 | 78.47 | 77.68 | 77.78 | 77.71 | 77.83 | 75.75 | 75.22 |
133
+ | KoSBERT | 80.39 | 82.13 | 82.25 | 80.67 | 80.75 | 80.69 | 80.78 | 77.96 | 77.90 |
134
+ | KoSRoBERTa | 81.64 | 81.20 | 82.20 | 81.79 | 82.34 | 81.59 | 82.20 | 80.62 | 81.25 |
135
+ | | | | | | | | | |
136
+ | KoSentenceBART | 77.14 | 79.71 | 78.74 | 78.42 | 78.02 | 78.40 | 78.00 | 74.24 | 72.15 |
137
+ | KoSentenceT5 | 77.83 | 80.87 | 79.74 | 80.24 | 79.36 | 80.19 | 79.27 | 72.81 | 70.17 |
138
+ | | | | | | | | | |
139
+ | KoSimCSE-BERT<sup>โ€ </sup><sub>SKT</sub> | 81.32 | 82.12 | 82.56 | 81.84 | 81.63 | 81.99 | 81.74 | 79.55 | 79.19 |
140
+ | KoSimCSE-BERT | 83.37 | 83.22 | 83.58 | 83.24 | 83.60 | 83.15 | 83.54 | 83.13 | 83.49 |
141
+ | KoSimCSE-RoBERTa | 83.65 | 83.60 | 83.77 | 83.54 | 83.76 | 83.55 | 83.77 | 83.55 | 83.64 |
142
+ | | | | | | | | | | |
143
+ | KoSimCSE-BERT-multitask | 85.71 | 85.29 | 86.02 | 85.63 | 86.01 | 85.57 | 85.97 | 85.26 | 85.93 |
144
+ | KoSimCSE-RoBERTa-multitask | 85.77 | 85.08 | 86.12 | 85.84 | 86.12 | 85.83 | 86.12 | 85.03 | 85.99 |
145
+
146
+ - [KoSBERT<sup>โ€ </sup><sub>SKT</sub>](https://github.com/BM-K/KoSentenceBERT-SKT)
147
+ - [KoSimCSE-BERT<sup>โ€ </sup><sub>SKT</sub>](https://github.com/BM-K/KoSimCSE-SKT)
148
+
149
+ ## Performance-unsupervised
150
+
151
+ | Model | Average | Cosine Pearson | Cosine Spearman | Euclidean Pearson | Euclidean Spearman | Manhattan Pearson | Manhattan Spearman | Dot Pearson | Dot Spearman |
152
+ |------------------------|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|:----:|
153
+ | KoSRoBERTa-base<sup>โ€ </sup> | N/A | N/A | 48.96 | N/A | N/A | N/A | N/A | N/A | N/A |
154
+ | KoSRoBERTa-large<sup>โ€ </sup> | N/A | N/A | 51.35 | N/A | N/A | N/A | N/A | N/A | N/A |
155
+ | | | | | | | | | | |
156
+ | KoSimCSE-BERT | 74.08 | 74.92 | 73.98 | 74.15 | 74.22 | 74.07 | 74.07 | 74.15 | 73.14 |
157
+ | KoSimCSE-RoBERTa | 75.27 | 75.93 | 75.00 | 75.28 | 75.01 | 75.17 | 74.83 | 75.95 | 75.01 |
158
+ | | | | | | | | | | |
159
+ | KoDiffCSE-RoBERTa | 77.17 | 77.73 | 76.96 | 77.21 | 76.89 | 77.11 | 76.81 | 77.74 | 76.97 |
160
+
161
+ - [Korean-SRoBERTa<sup>โ€ </sup>](https://arxiv.org/abs/2004.03289)
162
+
163
+ ## Downstream tasks
164
+ - KoSBERT: [Semantic Search](https://github.com/BM-K/Sentence-Embedding-is-all-you-need/tree/main/KoSBERT#semantic-search), [Clustering](https://github.com/BM-K/Sentence-Embedding-is-all-you-need/tree/main/KoSBERT#clustering)
165
+ - KoSimCSE: [Semantic Search](https://github.com/BM-K/Sentence-Embedding-is-all-you-need/tree/main/KoSimCSE#semantic-search), [Clustering](https://github.com/BM-K/Sentence-Embedding-Is-All-You-Need/tree/main/KoSimCSE#clustering)
166
+
167
+ ## License
168
+ This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
169
+
170
+ <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />
171
+
172
+ ## References
173
+
174
+ ```bibtex
175
+ @misc{park2021klue,
176
+ title={KLUE: Korean Language Understanding Evaluation},
177
+ author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jung-Woo Ha and Kyunghyun Cho},
178
+ year={2021},
179
+ eprint={2105.09680},
180
+ archivePrefix={arXiv},
181
+ primaryClass={cs.CL}
182
+ }
183
+
184
+ @inproceedings{gao2021simcse,
185
+ title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
186
+ author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
187
+ booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
188
+ year={2021}
189
+ }
190
+
191
+ @article{ham2020kornli,
192
+ title={KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding},
193
+ author={Ham, Jiyeon and Choe, Yo Joong and Park, Kyubyong and Choi, Ilji and Soh, Hyungjoon},
194
+ journal={arXiv preprint arXiv:2004.03289},
195
+ year={2020}
196
+ }
197
+
198
+ @inproceedings{reimers-2019-sentence-bert,
199
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
200
+ author = "Reimers, Nils and Gurevych, Iryna",
201
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
202
+ month = "11",
203
+ year = "2019",
204
+ publisher = "Association for Computational Linguistics",
205
+ url = "http://arxiv.org/abs/1908.10084",
206
+ }
207
+
208
+ @inproceedings{chuang2022diffcse,
209
+ title={{DiffCSE}: Difference-based Contrastive Learning for Sentence Embeddings},
210
+ author={Chuang, Yung-Sung and Dangovski, Rumen and Luo, Hongyin and Zhang, Yang and Chang, Shiyu and Soljacic, Marin and Li, Shang-Wen and Yih, Wen-tau and Kim, Yoon and Glass, James},
211
+ booktitle={Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)},
212
+ year={2022}
213
+ }
214
+ ```