gowitheflow commited on
Commit
acf4948
·
1 Parent(s): 71c1b6b

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Official model repo of EMNLP 2023 paper "Length is a Curse and a Blessing for Document-level Semantics"
2
+
3
+ ### Model Summary
4
+
5
+ LASER-cubed-bert-base-unsup is an **unsupervised** model trained on wiki1M dataset. Without needing the datasets to have long texts, it provides surprising generalizability on long document retrieval.
6
+
7
+ - **Developed by:** Chenghao Xiao, Yizhi Li, G Thomas Hudson, Chenghua Lin, Noura Al-Moubayed
8
+ - **Shared by:** Chenghao Xiao
9
+ - **Model type:** BERT-base
10
+ - **Language(s) (NLP):** English
11
+ - **Finetuned from model:** BERT-base-uncased
12
+
13
+ ### Model Sources
14
+
15
+ - **Github Repo:** https://github.com/gowitheflow-1998/LA-SER-cubed
16
+ - **Paper:** https://aclanthology.org/2023.emnlp-main.86/
17
+
18
+
19
+ ### Usage
20
+ Use the model with Sentence Transformers:
21
+ ```python
22
+ from sentence_transformers import SentenceTransformer
23
+ model = SentenceTransformer("gowitheflow/LASER-cubed-bert-base-unsup")
24
+
25
+ text = "LASER-cubed is a dope model - It generalizes to long texts without needing the training sets to have long texts."
26
+ representation = model.encode(text)
27
+ ```
28
+ ### Evaluation
29
+ Evaluate it with the BEIR framework:
30
+ ```python
31
+ from beir.retrieval import models
32
+ from beir.datasets.data_loader import GenericDataLoader
33
+ from beir.retrieval.evaluation import EvaluateRetrieval
34
+ from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES
35
+
36
+ # download the datasets with BEIR original repo youself first
37
+ data_path = './datasets/arguana'
38
+ corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
39
+ model = DRES(models.SentenceBERT("gowitheflow/LASER-cubed-bert-base-unsup"), batch_size=512)
40
+ retriever = EvaluateRetrieval(model, score_function="cos_sim")
41
+ results = retriever.retrieve(corpus, queries)
42
+ ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
43
+
44
+ ```
45
+ ### Downstream Use
46
+
47
+ Information Retrieval
48
+
49
+ ### Out-of-Scope Use
50
+
51
+ The model is not for further fine-tuning to do other tasks (such as classification), as it's trained to do representation tasks with similarity matching.
52
+
53
+
54
+
55
+ ## Training Details
56
+
57
+ max seq 256, batch size 256, lr 3e-05, 1 epoch, 10% warmup, 1 A100.
58
+
59
+ ### Training Data
60
+
61
+ wiki 1M
62
+
63
+ ### Training Procedure
64
+
65
+ Please refer to the paper.
66
+
67
+ ## Evaluation
68
+
69
+
70
+ ### Results
71
+
72
+
73
+
74
+ **BibTeX:**
75
+
76
+ @inproceedings{xiao2023length,
77
+ title={Length is a Curse and a Blessing for Document-level Semantics},
78
+ author={Xiao, Chenghao and Li, Yizhi and Hudson, G and Lin, Chenghua and Al Moubayed, Noura},
79
+ booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
80
+ pages={1385--1396},
81
+ year={2023}
82
+ }