SentenceTransformer based on sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

This is a sentence-transformers model finetuned from sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("fender2758/paraphrase-multilingual-MiniLM-L12-v2-winsearch")
# Run inference
sentences = [
    '제로에서 사용한 드림캐쳐 음악 파일이네요',
    'LowpanInterface.java',
    '60418b56ab1a0123678f0375882bfb04.declarations_content',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 2,295,920 training samples
  • Columns: sentence_0, sentence_1, and label
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1 label
    type string string float
    details
    • min: 2 tokens
    • mean: 15.08 tokens
    • max: 37 tokens
    • min: 3 tokens
    • mean: 15.7 tokens
    • max: 37 tokens
    • min: 0.0
    • mean: 0.09
    • max: 1.0
  • Samples:
    sentence_0 sentence_1 label
    '언링크드'라는 시리즈 중 두번째 파일의 확장자가 잘린 버전 WindowsSdk.java 0.0
    고무고래를 본 모습의 이미지 파일이군요. 1041a3b24d77e6d06b1415efe9b168ff.resolved 0.0
    'usb 디바이스 연결' 이미지 파일의 시리즈 중 하나 df5d0ee3b8db0133f64348dc17cab1ec.unlinked2 0.0
  • Loss: CosineSimilarityLoss with these parameters:
    {
        "loss_fct": "torch.nn.modules.loss.MSELoss"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 256
  • per_device_eval_batch_size: 256
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 256
  • per_device_eval_batch_size: 256
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: 2
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss
0.0557 500 0.0572
0.1115 1000 0.0485
0.1672 1500 0.046
0.2230 2000 0.0439
0.2787 2500 0.0419
0.3345 3000 0.0413
0.3902 3500 0.0405
0.4460 4000 0.0405
0.5017 4500 0.0394
0.5575 5000 0.0391
0.6132 5500 0.0385
0.6690 6000 0.0382
0.7247 6500 0.0387
0.7805 7000 0.0373
0.8362 7500 0.037
0.8920 8000 0.037
0.9477 8500 0.0368
1.0035 9000 0.0372
1.0592 9500 0.036
1.1150 10000 0.0356
1.1707 10500 0.0353
1.2264 11000 0.0347
1.2822 11500 0.0335
1.3379 12000 0.0344
1.3937 12500 0.034
1.4494 13000 0.0342
1.5052 13500 0.0337
1.5609 14000 0.0338
1.6167 14500 0.0331
1.6724 15000 0.0333
1.7282 15500 0.0338
1.7839 16000 0.0332
1.8397 16500 0.0328
1.8954 17000 0.0331
1.9512 17500 0.0329
2.0069 18000 0.0334
2.0627 18500 0.0326
2.1184 19000 0.0325
2.1742 19500 0.0323
2.2299 20000 0.0317
2.2857 20500 0.0309
2.3414 21000 0.0317
2.3971 21500 0.0315
2.4529 22000 0.0315
2.5086 22500 0.0313
2.5644 23000 0.0317
2.6201 23500 0.0311
2.6759 24000 0.0313
2.7316 24500 0.0316
2.7874 25000 0.0313
2.8431 25500 0.031
2.8989 26000 0.0313
2.9546 26500 0.0311

Framework Versions

  • Python: 3.9.16
  • Sentence Transformers: 3.3.1
  • Transformers: 4.44.1
  • PyTorch: 1.12.1.post201
  • Accelerate: 0.33.0
  • Datasets: 3.1.0
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
5
Safetensors
Model size
118M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for fender2758/paraphrase-multilingual-MiniLM-L12-v2-winsearch