gowitheflow
/

LASER-cubed-bert-base-unsup

Sentence Similarity

feature-extraction

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

gowitheflow commited on Dec 26, 2023

Commit

acf4948

·

1 Parent(s): 71c1b6b

Create README.md

Files changed (1) hide show

README.md +82 -0

README.md ADDED Viewed

	@@ -0,0 +1,82 @@

+Official model repo of EMNLP 2023 paper "Length is a Curse and a Blessing for Document-level Semantics"
+### Model Summary
+LASER-cubed-bert-base-unsup is an **unsupervised** model trained on wiki1M dataset. Without needing the datasets to have long texts, it provides surprising generalizability on long document retrieval.
+- **Developed by:** Chenghao Xiao, Yizhi Li, G Thomas Hudson, Chenghua Lin, Noura Al-Moubayed
+- **Shared by:** Chenghao Xiao
+- **Model type:** BERT-base
+- **Language(s) (NLP):** English
+- **Finetuned from model:** BERT-base-uncased
+### Model Sources
+- **Github Repo:** https://github.com/gowitheflow-1998/LA-SER-cubed
+- **Paper:** https://aclanthology.org/2023.emnlp-main.86/
+### Usage
+Use the model with Sentence Transformers:
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("gowitheflow/LASER-cubed-bert-base-unsup")
+text = "LASER-cubed is a dope model - It generalizes to long texts without needing the training sets to have long texts."
+representation = model.encode(text)
+```
+### Evaluation
+Evaluate it with the BEIR framework:
+```python
+from beir.retrieval import models
+from beir.datasets.data_loader import GenericDataLoader
+from beir.retrieval.evaluation import EvaluateRetrieval
+from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES
+# download the datasets with BEIR original repo youself first
+data_path = './datasets/arguana'
+corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")
+model = DRES(models.SentenceBERT("gowitheflow/LASER-cubed-bert-base-unsup"), batch_size=512)
+retriever = EvaluateRetrieval(model, score_function="cos_sim")
+results = retriever.retrieve(corpus, queries)
+ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
+```
+### Downstream Use
+Information Retrieval
+### Out-of-Scope Use
+The model is not for further fine-tuning to do other tasks (such as classification), as it's trained to do representation tasks with similarity matching.
+## Training Details
+max seq 256, batch size 256, lr 3e-05, 1 epoch, 10% warmup, 1 A100.
+### Training Data
+wiki 1M
+### Training Procedure
+Please refer to the paper.
+## Evaluation
+### Results
+**BibTeX:**
+@inproceedings{xiao2023length,
+  title={Length is a Curse and a Blessing for Document-level Semantics},
+  author={Xiao, Chenghao and Li, Yizhi and Hudson, G and Lin, Chenghua and Al Moubayed, Noura},
+  booktitle={Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing},
+  pages={1385--1396},
+  year={2023}
+}