SentenceTransformer based on distilbert/distilroberta-base

This is a sentence-transformers model finetuned from distilbert/distilroberta-base on the sentence-transformers/all-nli dataset. It maps sentences & paragraphs to a 256-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (reduced_dim): Dense({'in_features': 768, 'out_features': 256, 'bias': True, 'activation_function': 'torch.nn.modules.activation.Tanh'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("tomaarsen/distilroberta-base-nli-matryoshka-reduced")
# Run inference
sentences = [
    'A boy is vacuuming.',
    'A little boy is vacuuming the floor.',
    'A woman is applying eye shadow.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 256]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Semantic Similarity

Metric Value
pearson_cosine 0.833
spearman_cosine 0.845
pearson_manhattan 0.8284
spearman_manhattan 0.8314
pearson_euclidean 0.8291
spearman_euclidean 0.8319
pearson_dot 0.7274
spearman_dot 0.7358
pearson_max 0.833
spearman_max 0.845

Semantic Similarity

Metric Value
pearson_cosine 0.8266
spearman_cosine 0.8416
pearson_manhattan 0.825
spearman_manhattan 0.8277
pearson_euclidean 0.8256
spearman_euclidean 0.8285
pearson_dot 0.712
spearman_dot 0.7163
pearson_max 0.8266
spearman_max 0.8416

Semantic Similarity

Metric Value
pearson_cosine 0.8171
spearman_cosine 0.8356
pearson_manhattan 0.8176
spearman_manhattan 0.8213
pearson_euclidean 0.8175
spearman_euclidean 0.8216
pearson_dot 0.6852
spearman_dot 0.6861
pearson_max 0.8176
spearman_max 0.8356

Semantic Similarity

Metric Value
pearson_cosine 0.7964
spearman_cosine 0.8244
pearson_manhattan 0.7983
spearman_manhattan 0.8049
pearson_euclidean 0.8003
spearman_euclidean 0.807
pearson_dot 0.6312
spearman_dot 0.6277
pearson_max 0.8003
spearman_max 0.8244

Semantic Similarity

Metric Value
pearson_cosine 0.7401
spearman_cosine 0.7872
pearson_manhattan 0.761
spearman_manhattan 0.7761
pearson_euclidean 0.7645
spearman_euclidean 0.7794
pearson_dot 0.5202
spearman_dot 0.5115
pearson_max 0.7645
spearman_max 0.7872

Semantic Similarity

Metric Value
pearson_cosine 0.8124
spearman_cosine 0.8211
pearson_manhattan 0.7835
spearman_manhattan 0.7822
pearson_euclidean 0.7852
spearman_euclidean 0.784
pearson_dot 0.5917
spearman_dot 0.5785
pearson_max 0.8124
spearman_max 0.8211

Semantic Similarity

Metric Value
pearson_cosine 0.8079
spearman_cosine 0.819
pearson_manhattan 0.7795
spearman_manhattan 0.7786
pearson_euclidean 0.7813
spearman_euclidean 0.7813
pearson_dot 0.5714
spearman_dot 0.5602
pearson_max 0.8079
spearman_max 0.819

Semantic Similarity

Metric Value
pearson_cosine 0.7988
spearman_cosine 0.8129
pearson_manhattan 0.7728
spearman_manhattan 0.7728
pearson_euclidean 0.7735
spearman_euclidean 0.7751
pearson_dot 0.5397
spearman_dot 0.5279
pearson_max 0.7988
spearman_max 0.8129

Semantic Similarity

Metric Value
pearson_cosine 0.772
spearman_cosine 0.7936
pearson_manhattan 0.7561
spearman_manhattan 0.7597
pearson_euclidean 0.7581
spearman_euclidean 0.7628
pearson_dot 0.489
spearman_dot 0.4779
pearson_max 0.772
spearman_max 0.7936

Semantic Similarity

Metric Value
pearson_cosine 0.7138
spearman_cosine 0.7486
pearson_manhattan 0.7254
spearman_manhattan 0.7339
pearson_euclidean 0.7274
spearman_euclidean 0.7382
pearson_dot 0.3856
spearman_dot 0.3749
pearson_max 0.7274
spearman_max 0.7486

Training Details

Training Dataset

sentence-transformers/all-nli

  • Dataset: sentence-transformers/all-nli at 65dd388
  • Size: 557,850 training samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative
    type string string string
    details
    • min: 7 tokens
    • mean: 10.38 tokens
    • max: 45 tokens
    • min: 6 tokens
    • mean: 12.8 tokens
    • max: 39 tokens
    • min: 6 tokens
    • mean: 13.4 tokens
    • max: 50 tokens
  • Samples:
    anchor positive negative
    A person on a horse jumps over a broken down airplane. A person is outdoors, on a horse. A person is at a diner, ordering an omelette.
    Children smiling and waving at camera There are children present The kids are frowning
    A boy is jumping on skateboard in the middle of a red bridge. The boy does a skateboarding trick. The boy skates down the sidewalk.
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            256,
            128,
            64,
            32,
            16
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Evaluation Dataset

sentence-transformers/stsb

  • Dataset: sentence-transformers/stsb at ab7a5ac
  • Size: 1,500 evaluation samples
  • Columns: sentence1, sentence2, and score
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 score
    type string string float
    details
    • min: 5 tokens
    • mean: 15.0 tokens
    • max: 44 tokens
    • min: 6 tokens
    • mean: 14.99 tokens
    • max: 61 tokens
    • min: 0.0
    • mean: 0.47
    • max: 1.0
  • Samples:
    sentence1 sentence2 score
    A man with a hard hat is dancing. A man wearing a hard hat is dancing. 1.0
    A young child is riding a horse. A child is riding a horse. 0.95
    A man is feeding a mouse to a snake. The man is feeding a mouse to the snake. 1.0
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            256,
            128,
            64,
            32,
            16
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • num_train_epochs: 1
  • warmup_ratio: 0.1
  • fp16: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: False
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: None
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss loss sts-dev-128_spearman_cosine sts-dev-16_spearman_cosine sts-dev-256_spearman_cosine sts-dev-32_spearman_cosine sts-dev-64_spearman_cosine sts-test-128_spearman_cosine sts-test-16_spearman_cosine sts-test-256_spearman_cosine sts-test-32_spearman_cosine sts-test-64_spearman_cosine
0.0229 100 21.0363 14.2448 0.7856 0.7417 0.7873 0.7751 0.7846 - - - - -
0.0459 200 11.1093 13.4736 0.7877 0.7298 0.7861 0.7687 0.7798 - - - - -
0.0688 300 10.1847 13.7191 0.7877 0.7284 0.7898 0.7617 0.7755 - - - - -
0.0918 400 9.356 13.2955 0.7906 0.7385 0.7914 0.7715 0.7799 - - - - -
0.1147 500 8.9318 12.8099 0.7889 0.7346 0.7910 0.7690 0.7801 - - - - -
0.1376 600 8.5293 13.7384 0.7814 0.7362 0.7866 0.7656 0.7736 - - - - -
0.1606 700 8.7589 13.4466 0.7899 0.7467 0.7945 0.7770 0.7847 - - - - -
0.1835 800 7.7941 13.6734 0.7960 0.7526 0.7986 0.7800 0.7894 - - - - -
0.2065 900 7.9183 12.9082 0.7885 0.7470 0.7966 0.7705 0.7803 - - - - -
0.2294 1000 7.3669 13.2827 0.7751 0.7181 0.7822 0.7557 0.7675 - - - - -
0.2524 1100 7.6205 13.0227 0.7875 0.7373 0.7914 0.7730 0.7828 - - - - -
0.2753 1200 7.4308 13.4980 0.7844 0.7373 0.7890 0.7709 0.7755 - - - - -
0.2982 1300 7.3625 12.8380 0.7984 0.7520 0.8032 0.7824 0.7915 - - - - -
0.3212 1400 6.9421 12.7016 0.7912 0.7358 0.7960 0.7749 0.7850 - - - - -
0.3441 1500 7.0635 13.2198 0.8018 0.7578 0.8070 0.7861 0.7961 - - - - -
0.3671 1600 6.6682 13.3225 0.7906 0.7522 0.7944 0.7763 0.7849 - - - - -
0.3900 1700 6.42 12.7381 0.7984 0.7449 0.8021 0.7806 0.7911 - - - - -
0.4129 1800 6.659 13.0247 0.7947 0.7461 0.8002 0.7808 0.7876 - - - - -
0.4359 1900 6.1664 12.6814 0.7893 0.7312 0.7959 0.7700 0.7807 - - - - -
0.4588 2000 6.392 13.0238 0.7935 0.7354 0.7987 0.7758 0.7860 - - - - -
0.4818 2100 6.177 12.8833 0.7891 0.7428 0.7924 0.7723 0.7801 - - - - -
0.5047 2200 6.0411 12.5269 0.7836 0.7400 0.7875 0.7664 0.7765 - - - - -
0.5276 2300 6.1506 13.4349 0.7741 0.7350 0.7803 0.7556 0.7634 - - - - -
0.5506 2400 6.109 12.6996 0.7808 0.7326 0.7860 0.7663 0.7735 - - - - -
0.5735 2500 6.2849 13.2831 0.7874 0.7365 0.7932 0.7727 0.7794 - - - - -
0.5965 2600 6.0658 12.9425 0.7988 0.7481 0.8042 0.7818 0.7889 - - - - -
0.6194 2700 6.0646 13.0144 0.7965 0.7509 0.8010 0.7800 0.7875 - - - - -
0.6423 2800 6.0795 12.7602 0.7912 0.7472 0.7937 0.7778 0.7818 - - - - -
0.6653 2900 6.2407 13.2381 0.7829 0.7381 0.7873 0.7664 0.7765 - - - - -
0.6882 3000 6.1872 12.9064 0.7942 0.7516 0.7965 0.7793 0.7857 - - - - -
0.7112 3100 5.8987 12.9323 0.8065 0.7585 0.8087 0.7909 0.7989 - - - - -
0.7341 3200 5.996 13.1017 0.7971 0.7566 0.8005 0.7811 0.7889 - - - - -
0.7571 3300 5.3748 12.7601 0.8398 0.7881 0.8441 0.8232 0.8337 - - - - -
0.7800 3400 4.0798 12.7221 0.8400 0.7908 0.8440 0.8255 0.8342 - - - - -
0.8029 3500 3.6024 12.5445 0.8408 0.7892 0.8447 0.8247 0.8347 - - - - -
0.8259 3600 3.4619 12.6025 0.8405 0.7883 0.8442 0.8255 0.8347 - - - - -
0.8488 3700 3.2288 12.6636 0.8388 0.7872 0.8433 0.8226 0.8330 - - - - -
0.8718 3800 3.0543 12.6475 0.8386 0.7834 0.8427 0.8229 0.8330 - - - - -
0.8947 3900 3.0368 12.5390 0.8407 0.7845 0.8444 0.8227 0.8346 - - - - -
0.9176 4000 2.9591 12.5709 0.8419 0.7864 0.8456 0.8245 0.8359 - - - - -
0.9406 4100 2.944 12.6029 0.8415 0.7868 0.8452 0.8245 0.8359 - - - - -
0.9635 4200 2.9032 12.5514 0.8423 0.7888 0.8455 0.8254 0.8363 - - - - -
0.9865 4300 2.838 12.6054 0.8416 0.7872 0.8450 0.8244 0.8356 - - - - -
1.0 4359 - - - - - - - 0.8190 0.7486 0.8211 0.7936 0.8129

Environmental Impact

Carbon emissions were measured using CodeCarbon.

  • Energy Consumed: 0.244 kWh
  • Carbon Emitted: 0.095 kg of CO2
  • Hours Used: 0.923 hours

Training Hardware

  • On Cloud: No
  • GPU Model: 1 x NVIDIA GeForce RTX 3090
  • CPU Model: 13th Gen Intel(R) Core(TM) i7-13700K
  • RAM Size: 31.78 GB

Framework Versions

  • Python: 3.11.6
  • Sentence Transformers: 3.0.0.dev0
  • Transformers: 4.41.0.dev0
  • PyTorch: 2.3.0+cu121
  • Accelerate: 0.26.1
  • Datasets: 2.18.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning}, 
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
3
Safetensors
Model size
82.1M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for tomaarsen/distilroberta-base-nli-matryoshka-reduced

Finetuned
(566)
this model

Evaluation results