JFernandoGRE's picture
Add new SentenceTransformer model
3bb500e verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:11480
  - loss:OnlineContrastiveLoss
base_model: thenlper/gte-large
widget:
  - source_sentence: PEÑA JARAMILLO
    sentences:
      - OSCAR ALBERTO ARREDONDO CASTANO
      - JAIME ALBERTO QUINETRO SABOYA
      - NESTOR HENRRY REYES PEÑA
  - source_sentence: ALBERTO ANTONIO ZAPATA ELJACH ANUAR
    sentences:
      - ' SANTIAGOPE MORENO'
      - DIEGOALVAREZ HERNANDEZ
      - GABRIEL ALVARO ZAPATA B
  - source_sentence: PAULA ANDREA VARGAS LOPEZ
    sentences:
      - LUZ MILENE GONZALEZ BRAVO
      - FLAVIO ALBERTO DE JESUS ROLDAN MARTINEZ
      - CAMILIA ANDREA VARGAS LOPEZ
  - source_sentence: RAFAEL ANTONIO MARTINEZ RODRIGUEZ
    sentences:
      - RAFAEL TOMAS MARTINEZ RODRIGUEZ
      - LEONOR DE
      - MARTHA EUGEN GARCIA DE MARTINEZ VILLALBA
  - source_sentence: ADRIANA JOSEFINA GRATEROL DE GUTIERREZ
    sentences:
      - CLAUDIA ROCIO RINCON SANCHEZ
      - JOHSON ENRIQUE CORTES CRTES
      - GUSTAVO LONDONO GUTIERREZ
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on thenlper/gte-large

This is a sentence-transformers model finetuned from thenlper/gte-large. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: thenlper/gte-large
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("JFernandoGRE/gtelarge-colombian-elitenames2")
# Run inference
sentences = [
    'ADRIANA JOSEFINA GRATEROL DE GUTIERREZ',
    'GUSTAVO LONDONO GUTIERREZ',
    'CLAUDIA ROCIO RINCON SANCHEZ',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 11,480 training samples
  • Columns: sentence1, sentence2, and label
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 label
    type string string int
    details
    • min: 4 tokens
    • mean: 8.22 tokens
    • max: 19 tokens
    • min: 4 tokens
    • mean: 8.74 tokens
    • max: 15 tokens
    • 0: ~82.40%
    • 1: ~17.60%
  • Samples:
    sentence1 sentence2 label
    ADELY ROMERO PETERMANJARREZ ROMERO 0
    JENIFERCAÑAVERAL QUINTERO JENIFER MARIA CAÑAVERAL QUINTERO 1
    ALBERTO ALBARRACIN VILLAMIZAR ESSENFELL ANUAR ALBERTO PEREZ ESCAF 0
  • Loss: OnlineContrastiveLoss

Evaluation Dataset

Unnamed Dataset

  • Size: 2,870 evaluation samples
  • Columns: sentence1, sentence2, and label
  • Approximate statistics based on the first 1000 samples:
    sentence1 sentence2 label
    type string string int
    details
    • min: 4 tokens
    • mean: 8.25 tokens
    • max: 19 tokens
    • min: 4 tokens
    • mean: 8.65 tokens
    • max: 15 tokens
    • 0: ~82.60%
    • 1: ~17.40%
  • Samples:
    sentence1 sentence2 label
    PEDRO NEL SIERRA CARDONA E HIJOS S EN C EN LIQUIDACION LIQUIDACION PEDRO NEL SIERRA CARDONA 1
    ALIKY LONDOÑO BOTERO ELVIA CRISTINA LONDOÑO MENECES 0
    FERNANDO GUTIERREZ DE PIÑERES HERZBERG HER GUTIERREZ DE PIÑERES 1
  • Loss: OnlineContrastiveLoss

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • learning_rate: 1e-05
  • num_train_epochs: 5
  • warmup_ratio: 0.182
  • fp16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 1e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 5
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.182
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss
0.1393 100 0.2368 0.2417
0.2786 200 0.1276 0.2186
0.4178 300 0.1381 0.1803
0.5571 400 0.1242 0.1682
0.6964 500 0.1113 0.1741
0.8357 600 0.1047 0.1321
0.9749 700 0.0906 0.1298
1.1142 800 0.0701 0.1270
1.2535 900 0.0702 0.1135
1.3928 1000 0.0807 0.0960
1.5320 1100 0.0632 0.0980
1.6713 1200 0.0666 0.0931
1.8106 1300 0.0773 0.0921
1.9499 1400 0.0738 0.0821
2.0891 1500 0.0585 0.0807
2.2284 1600 0.0359 0.0838
2.3677 1700 0.0509 0.0757
2.5070 1800 0.0393 0.0811
2.6462 1900 0.0437 0.0774
2.7855 2000 0.0258 0.0802
2.9248 2100 0.0437 0.0825
3.0641 2200 0.0297 0.0789
3.2033 2300 0.0308 0.0788
3.3426 2400 0.0411 0.0772
3.4819 2500 0.0322 0.0794
3.6212 2600 0.0268 0.0793
3.7604 2700 0.036 0.0839
3.8997 2800 0.033 0.0821
4.0390 2900 0.0299 0.0794
4.1783 3000 0.0226 0.0797
4.3175 3100 0.0198 0.0760
4.4568 3200 0.0293 0.0771
4.5961 3300 0.0274 0.0747
4.7354 3400 0.0162 0.0746
4.8747 3500 0.0351 0.0745

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.49.0
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.5.2
  • Datasets: 3.4.1
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}