GIST-all-MiniLM-L6-v2-GGUF

Quantized GGUF model files for GIST-all-MiniLM-L6-v2 from avsolatorio

Original Model Card:

GIST Embedding v0 - all-MiniLM-L6-v2

GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning

The model is fine-tuned on top of the sentence-transformers/all-MiniLM-L6-v2 using the MEDI dataset augmented with mined triplets from the MTEB Classification training dataset (excluding data from the Amazon Polarity Classification task).

The model does not require any instruction for generating embeddings. This means that queries for retrieval tasks can be directly encoded without crafting instructions.

Technical paper: GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning

Data

The dataset used is a compilation of the MEDI and MTEB Classification training datasets. Third-party datasets may be subject to additional terms and conditions under their associated licenses. A HuggingFace Dataset version of the compiled dataset, and the specific revision used to train the model, is available:

Dataset: avsolatorio/medi-data-mteb_avs_triplets
Revision: 238a0499b6e6b690cc64ea56fde8461daa8341bb

The dataset contains a task_type key, which can be used to select only the mteb classification tasks (prefixed with mteb_).

The MEDI Dataset is published in the following paper: One Embedder, Any Task: Instruction-Finetuned Text Embeddings.

The MTEB Benchmark results of the GIST embedding model, compared with the base model, suggest that the fine-tuning dataset has perturbed the model considerably, which resulted in significant improvements in certain tasks while adversely degrading performance in some.

The retrieval performance for the TRECCOVID task is of note. The fine-tuning dataset does not contain significant knowledge about COVID-19, which could have caused the observed performance degradation. We found some evidence, detailed in the paper, that thematic coverage of the fine-tuning data can affect downstream performance.

Usage

The model can be easily loaded using the Sentence Transformers library.

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

revision = None  # Replace with the specific revision to ensure reproducibility if the model is updated.

model = SentenceTransformer("avsolatorio/GIST-all-MiniLM-L6-v2", revision=revision)

texts = [
    "Illustration of the REaLTabFormer model. The left block shows the non-relational tabular data model using GPT-2 with a causal LM head. In contrast, the right block shows how a relational dataset's child table is modeled using a sequence-to-sequence (Seq2Seq) model. The Seq2Seq model uses the observations in the parent table to condition the generation of the observations in the child table. The trained GPT-2 model on the parent table, with weights frozen, is also used as the encoder in the Seq2Seq model.",
    "Predicting human mobility holds significant practical value, with applications ranging from enhancing disaster risk planning to simulating epidemic spread. In this paper, we present the GeoFormer, a decoder-only transformer model adapted from the GPT architecture to forecast human mobility.",
    "As the economies of Southeast Asia continue adopting digital technologies, policy makers increasingly ask how to prepare the workforce for emerging labor demands. However, little is known about the skills that workers need to adapt to these changes"
]

# Compute embeddings
embeddings = model.encode(texts, convert_to_tensor=True)

# Compute cosine-similarity for each pair of sentences
scores = F.cosine_similarity(embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1)

print(scores.cpu().numpy())

Training Parameters

Below are the training parameters used to fine-tune the model:

Epochs = 40
Warmup ratio = 0.1
Learning rate = 5e-6
Batch size = 16
Checkpoint step = 102000
Contrastive loss temperature = 0.01

Evaluation

The model was evaluated using the MTEB Evaluation suite.

Citation

Please cite our work if you use GISTEmbed or the datasets we published in your projects or research. 🤗

@article{solatorio2024gistembed,
    title={GISTEmbed: Guided In-sample Selection of Training Negatives for Text Embedding Fine-tuning},
    author={Aivin V. Solatorio},
    journal={arXiv preprint arXiv:2402.16829},
    year={2024},
    URL={https://arxiv.org/abs/2402.16829}
    eprint={2402.16829},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

Acknowledgements

This work is supported by the "KCP IV - Exploring Data Use in the Development Economics Literature using Large Language Models (AI and LLMs)" project funded by the Knowledge for Change Program (KCP) of the World Bank - RA-P503405-RESE-TF0C3444.

The findings, interpretations, and conclusions expressed in this material are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.

Downloads last month: 36

GGUF

Model size

22.6M params

Architecture

bert

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

View +1 file

Inference Providers NEW

Text Generation

This model is not currently available via any of the supported Inference Providers.

The model cannot be deployed to the HF Inference API: The HF Inference API does not support text-generation models for sentence-transformers library.

Model tree for afrideva/GIST-all-MiniLM-L6-v2-GGUF

Base model

avsolatorio/GIST-all-MiniLM-L6-v2

Quantized

(2)

this model

Spaces using afrideva/GIST-all-MiniLM-L6-v2-GGUF 4

Evaluation results

accuracy on MTEB AmazonCounterfactualClassification (en)
test set self-reported

72.896
ap on MTEB AmazonCounterfactualClassification (en)
test set self-reported

35.448
f1 on MTEB AmazonCounterfactualClassification (en)
test set self-reported

66.830
accuracy on MTEB AmazonPolarityClassification
test set self-reported

87.195
ap on MTEB AmazonPolarityClassification
test set self-reported

83.096
f1 on MTEB AmazonPolarityClassification
test set self-reported

87.138
accuracy on MTEB AmazonReviewsClassification (en)
test set self-reported

42.556
f1 on MTEB AmazonReviewsClassification (en)
test set self-reported

42.236
map_at_1 on MTEB ArguAna
test set self-reported

26.885
map_at_10 on MTEB ArguAna
test set self-reported

42.364

View on Papers With Code