SentenceTransformer based on Alibaba-NLP/gte-modernbert-base

This is a sentence-transformers model finetuned from Alibaba-NLP/gte-modernbert-base. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Alibaba-NLP/gte-modernbert-base
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: ModernBertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    "How is 'associated undertaking' defined, and what criteria determine the significant influence of one undertaking over another in terms of voting rights?",
    "▼B\n\n(6)\n\n‘purchase price’ means the price payable and any incidental expenses minus any incidental reductions in the cost of acquisition;\n\n(7)\n\n‘production cost’ means the purchase price of raw materials, consumables and other costs directly attributable to the item in question. Member States shall permit or require the inclusion of a reasonable proportion of fixed or variable overhead costs indirectly attributable to the item in question, to the extent that they relate to the period of production. Distribution costs shall not be included;\n\n(8)\n\n‘value adjustment’ means the adjustments intended to take account of changes in the values of individual assets established at the balance sheet date, whether the change is final or not;\n\n(9)\n\n‘parent undertaking’ means an undertaking which controls one or more subsidiary undertakings;\n\n(10)\n\n‘subsidiary undertaking’ means an undertaking controlled by a parent undertaking, including any subsidiary undertaking of an ultimate parent undertaking;\n\n(11)\n\n‘group’ means a parent undertaking and all its subsidiary undertakings;\n\n(12)\n\n‘affiliated undertakings’ means any two or more undertakings within a group;\n\n(13)\n\n‘associated undertaking’ means an undertaking in which another undertaking has a participating interest, and over whose operating and financial policies that other undertaking exercises significant influence. An undertaking is presumed to exercise a significant influence over another undertaking where it has 20 % or more of the shareholders' or members' voting rights in that other undertaking;\n\n(14)\n\n‘investment undertakings’ means:\n\n(a)\n\nundertakings the sole object of which is to invest their funds in various securities, real property and other assets, with the sole aim of spreading investment risks and giving their shareholders the benefit of the results of the management of their assets,\n\n(b)\n\nundertakings associated with investment undertakings with fixed capital, if the sole object of those associated undertakings is to acquire fully paid shares issued by those investment undertakings without prejudice to point (h) of Article 22(1) of Directive 2012/30/EU;\n\n(15)",
    'and non-European non-financial corporations not subject to the disclosure obligations laid down in Directive 2013/34/EU. That information may be disclosed only once, based on counterparties’ turnover alignment for the general-purpose lending loans, as in the case of the GAR. The first disclosure reference date of this template is as of 31 December 2024. Institutions are not required to disclose this information before 1 January 2025. ---|---|---',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.691
cosine_accuracy@3 0.9109
cosine_accuracy@5 0.9461
cosine_accuracy@10 0.9743
cosine_precision@1 0.691
cosine_precision@3 0.3036
cosine_precision@5 0.1892
cosine_precision@10 0.0974
cosine_recall@1 0.691
cosine_recall@3 0.9109
cosine_recall@5 0.9461
cosine_recall@10 0.9743
cosine_ndcg@10 0.8472
cosine_mrr@10 0.8048
cosine_map@100 0.8061

Training Details

Training Dataset

Unnamed Dataset

  • Size: 46,338 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 13 tokens
    • mean: 34.18 tokens
    • max: 251 tokens
    • min: 7 tokens
    • mean: 231.33 tokens
    • max: 2146 tokens
  • Samples:
    sentence_0 sentence_1
    How is 'energy efficiency' defined in the context of Directive (EU) 2018/2001? of Directive (EU) 2018/2001; --- --- (8) ‘energy efficiency’ means the ratio of output of performance, service, goods or energy to input of energy; --- --- (9) ‘energy savings’ means an amount of saved energy determined by measuring or estimating consumption, or both,, before and after the implementation of an energy efficiency improvement measure, whilst ensuring normalisation for external conditions that affect energy consumption; --- --- (10) ‘energy efficiency improvement’ means an increase in energy efficiency as a result of any technological, behavioural or economic changes; --- --- (11) ‘energy service’ means the physical benefit, utility or good derived from a combination of energy with energy-efficient technology or with action,
    What are the sources of information that the external experts will use to create the list of conflict-affected and high-risk areas? 2.

    The Commission shall call upon external expertise that will provide an indicative, non-exhaustive, regularly updated list of conflict-affected and high-risk areas. That list shall be based on the external experts' analysis of the handbook referred to in paragraph 1 and existing information from, inter alia, academics and supply chain due diligence schemes. Union importers sourcing from areas which are not mentioned on that list shall also maintain their responsibility to comply with the due diligence obligations under this Regulation.

    Article 15

    Committee procedure

    1.

    The Commission shall be assisted by a committee. That committee shall be a committee within the meaning of Regulation (EU) No 182/2011.

    2.
    What is the maximum time frame for completing the undertaking according to the technical specifications set out in Annexes II and III after the Directive enters into force? is undertaken according to the technical specifications set out in Annexes II and III and that it is completed at the latest four years after the date of entry into force of this Directive.

    2. The analyses and reviews mentioned under paragraph 1 shall be reviewed, and if necessary updated at the latest 13 years after the date of entry into force of this Directive and every six years thereafter.

    Article 6

    Register of protected areas
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 4
  • per_device_eval_batch_size: 4
  • num_train_epochs: 4
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 4
  • per_device_eval_batch_size: 4
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss cosine_ndcg@10
0.0432 500 0.358 -
0.0863 1000 0.1048 -
0.1295 1500 0.0827 -
0.1726 2000 0.067 0.7969
0.2158 2500 0.0491 -
0.2590 3000 0.0831 -
0.3021 3500 0.062 -
0.3453 4000 0.0657 0.8050
0.3884 4500 0.0522 -
0.4316 5000 0.049 -
0.4748 5500 0.0426 -
0.5179 6000 0.0708 0.8215
0.5611 6500 0.0236 -
0.6042 7000 0.024 -
0.6474 7500 0.0256 -
0.6905 8000 0.041 0.8105
0.7337 8500 0.0285 -
0.7769 9000 0.0249 -
0.8200 9500 0.0368 -
0.8632 10000 0.0588 0.8118
0.9063 10500 0.0386 -
0.9495 11000 0.0456 -
0.9927 11500 0.0399 -
1.0 11585 - 0.8184
1.0358 12000 0.0424 0.8239
1.0790 12500 0.0107 -
1.1221 13000 0.0279 -
1.1653 13500 0.0236 -
1.2085 14000 0.024 0.8193
1.2516 14500 0.0143 -
1.2948 15000 0.0118 -
1.3379 15500 0.0078 -
1.3811 16000 0.023 0.8217
1.4243 16500 0.0239 -
1.4674 17000 0.0335 -
1.5106 17500 0.0119 -
1.5537 18000 0.0411 0.8292
1.5969 18500 0.0168 -
1.6401 19000 0.0059 -
1.6832 19500 0.0234 -
1.7264 20000 0.0184 0.8366
1.7695 20500 0.0128 -
1.8127 21000 0.0166 -
1.8558 21500 0.0181 -
1.8990 22000 0.0148 0.8353
1.9422 22500 0.0225 -
1.9853 23000 0.0158 -
2.0 23170 - 0.8360
2.0285 23500 0.0123 -
2.0716 24000 0.0173 0.8329
2.1148 24500 0.0167 -
2.1580 25000 0.0125 -
2.2011 25500 0.013 -
2.2443 26000 0.0079 0.8338
2.2874 26500 0.007 -
2.3306 27000 0.0171 -
2.3738 27500 0.0058 -
2.4169 28000 0.0048 0.8405
2.4601 28500 0.005 -
2.5032 29000 0.0141 -
2.5464 29500 0.0132 -
2.5896 30000 0.006 0.8461
2.6327 30500 0.0095 -
2.6759 31000 0.0061 -
2.7190 31500 0.0107 -
2.7622 32000 0.0157 0.8451
2.8054 32500 0.005 -
2.8485 33000 0.0087 -
2.8917 33500 0.0064 -
2.9348 34000 0.005 0.8449
2.9780 34500 0.0115 -
3.0 34755 - 0.8451
3.0211 35000 0.0079 -
3.0643 35500 0.0045 -
3.1075 36000 0.0029 0.8443
3.1506 36500 0.0161 -
3.1938 37000 0.0144 -
3.2369 37500 0.0076 -
3.2801 38000 0.0157 0.8500
3.3233 38500 0.0039 -
3.3664 39000 0.0045 -
3.4096 39500 0.0033 -
3.4527 40000 0.0064 0.8434
3.4959 40500 0.0054 -
3.5391 41000 0.0061 -
3.5822 41500 0.0051 -
3.6254 42000 0.0019 0.8472

Framework Versions

  • Python: 3.10.15
  • Sentence Transformers: 3.4.1
  • Transformers: 4.49.0
  • PyTorch: 2.6.0+cu126
  • Accelerate: 1.5.2
  • Datasets: 3.4.1
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
4,527
Safetensors
Model size
149M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amentaphd/gte-modernbert-base

Finetuned
(11)
this model

Evaluation results