MugheesAwan11's picture
Add new SentenceTransformer model.
7f607b5 verified
|
raw
history blame
28.7 kB
metadata
language:
  - en
license: apache-2.0
library_name: sentence-transformers
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:161
  - loss:MatryoshkaLoss
  - loss:MultipleNegativesRankingLoss
base_model: BAAI/bge-base-en-v1.5
datasets: []
metrics:
  - cosine_accuracy@1
  - cosine_accuracy@3
  - cosine_accuracy@5
  - cosine_accuracy@10
  - cosine_precision@1
  - cosine_precision@3
  - cosine_precision@5
  - cosine_precision@10
  - cosine_recall@1
  - cosine_recall@3
  - cosine_recall@5
  - cosine_recall@10
  - cosine_ndcg@10
  - cosine_mrr@10
  - cosine_map@100
widget:
  - source_sentence: >-
      As per Part II of the PDPA, Personal Data Protection Commission (PDPC) is
      the

      regulatory body to enforce the provisions of PDPA. The PDPC is empowered
      with

      broad discretion to issue remedial directions, initiate investigation

      inquiries, and impose fines and penalties on the organisations in case of
      any

      non-compliance of PDPA.


      1


      If organisations misuse the personal data or hide information concerning
      its

      collection, use, or disclosure, PDPA states penalties not exceeding
      **S$50,000

      (approx. $36,000)**.


      2


      Penalty for hindering a PDPC investigation can lead to a fine of not more
      than

      **S$100,000 (approx. $72,000)**. The PDPA states that companies are also

      liable for their employees’ actions, whether they are aware of them or
      not.


      3


      New amendments to PDPA have enforced increased financial penalties for

      breaches of the PDPA up to **10%** of annual gross turnover in Singapore,
      or

      **S$ 1 million** , whichever is higher.


      4


      Non-compliance with specific provisions under the PDPA may also constitute
      an

      offense, for which a fine or a term of **imprisonment** may be imposed.


      5


      An individual can bring a private civil action against an organisation for

      having suffered **loss or damage** directly due to a contravention of the

      provisions of the PDPA.
    sentences:
      - What is the right to notice under the CCPA?
      - What are the risks of non-compliance with the PDPA?
      - What is the definition of personal data under the PDP Law?
  - source_sentence: >-
      The DPA requires all data controllers to take appropriate technical and
      organisational measures that are necessary to protect data from
      unauthorised destruction, negligent loss, unauthorised alteration or
      access and any other unauthorised processing of the data.
    sentences:
      - Which regulatory authority enforces GDPR in France?
      - What are the security requirements under the DPA?
      - How do PIPEDA and GDPR differ?
  - source_sentence: >-
      if the data controller or the data processor holds a valid registration
      certificate authorizing him or her to store personal data outside Rwanda
    sentences:
      - What is the difference between GDPR and a Data Protection Act?
      - What is the voluntary certification by the CPPA?
      - Where is personal data storage outside of Rwanda permitted?
  - source_sentence: >-
      The PDP law will regulate sensitive personal data as well as other
      personal data that may endanger or harm the privacy of the data subject.
    sentences:
      - What is the material scope of the PDP Law?
      - >-
        What is the definition of personal information under the DPA in the
        Philippines?
      - What does Securiti offer to help with data privacy compliance?
  - source_sentence: >-
      Thailand's PDPA applies to any legal entity collecting, using, or
      disclosing a natural (and alive) person's personal data.
    sentences:
      - Who does the Thailand's PDPA apply to?
      - >-
        What penalties could an organization face for infringing Kenya's Data
        Protection Act?
      - What is the CPRA?
pipeline_tag: sentence-similarity
model-index:
  - name: SentenceTransformer based on BAAI/bge-base-en-v1.5
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 768
          type: dim_768
        metrics:
          - type: cosine_accuracy@1
            value: 0.5555555555555556
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.8333333333333334
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.8888888888888888
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 1
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.5555555555555556
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.27777777777777773
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.17777777777777778
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.10000000000000002
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.5555555555555556
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.8333333333333334
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.8888888888888888
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 1
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.7730002998303461
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.7011463844797178
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.7011463844797178
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 512
          type: dim_512
        metrics:
          - type: cosine_accuracy@1
            value: 0.5
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.8333333333333334
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.8888888888888888
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 1
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.5
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.27777777777777773
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.17777777777777778
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.10000000000000002
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.5
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.8333333333333334
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.8888888888888888
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 1
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.753767166905132
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.6746913580246914
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.6746913580246914
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 256
          type: dim_256
        metrics:
          - type: cosine_accuracy@1
            value: 0.5
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.8888888888888888
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.9444444444444444
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 1
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.5
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.2962962962962962
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.1888888888888889
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.10000000000000002
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.5
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.8888888888888888
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.9444444444444444
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 1
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.7698314695487533
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.6939814814814815
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.6939814814814815
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 128
          type: dim_128
        metrics:
          - type: cosine_accuracy@1
            value: 0.5
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.8333333333333334
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.8888888888888888
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.9444444444444444
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.5
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.27777777777777773
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.1777777777777778
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.09444444444444446
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.5
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.8333333333333334
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.8888888888888888
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.9444444444444444
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.7436864067552591
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.6774691358024691
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.6799943883277217
            name: Cosine Map@100
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: dim 64
          type: dim_64
        metrics:
          - type: cosine_accuracy@1
            value: 0.4444444444444444
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.6666666666666666
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.8333333333333334
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 1
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.4444444444444444
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.2222222222222222
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.16666666666666669
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.10000000000000002
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.4444444444444444
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.6666666666666666
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.8333333333333334
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 1
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.7007609579807462
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.6075617283950616
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.6075617283950616
            name: Cosine Map@100

SentenceTransformer based on BAAI/bge-base-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-base-en-v1.5. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-base-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("MugheesAwan11/bge-base-securiti-dataset-1-v6")
# Run inference
sentences = [
    "Thailand's PDPA applies to any legal entity collecting, using, or disclosing a natural (and alive) person's personal data.",
    "Who does the Thailand's PDPA apply to?",
    "What penalties could an organization face for infringing Kenya's Data Protection Act?",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.5556
cosine_accuracy@3 0.8333
cosine_accuracy@5 0.8889
cosine_accuracy@10 1.0
cosine_precision@1 0.5556
cosine_precision@3 0.2778
cosine_precision@5 0.1778
cosine_precision@10 0.1
cosine_recall@1 0.5556
cosine_recall@3 0.8333
cosine_recall@5 0.8889
cosine_recall@10 1.0
cosine_ndcg@10 0.773
cosine_mrr@10 0.7011
cosine_map@100 0.7011

Information Retrieval

Metric Value
cosine_accuracy@1 0.5
cosine_accuracy@3 0.8333
cosine_accuracy@5 0.8889
cosine_accuracy@10 1.0
cosine_precision@1 0.5
cosine_precision@3 0.2778
cosine_precision@5 0.1778
cosine_precision@10 0.1
cosine_recall@1 0.5
cosine_recall@3 0.8333
cosine_recall@5 0.8889
cosine_recall@10 1.0
cosine_ndcg@10 0.7538
cosine_mrr@10 0.6747
cosine_map@100 0.6747

Information Retrieval

Metric Value
cosine_accuracy@1 0.5
cosine_accuracy@3 0.8889
cosine_accuracy@5 0.9444
cosine_accuracy@10 1.0
cosine_precision@1 0.5
cosine_precision@3 0.2963
cosine_precision@5 0.1889
cosine_precision@10 0.1
cosine_recall@1 0.5
cosine_recall@3 0.8889
cosine_recall@5 0.9444
cosine_recall@10 1.0
cosine_ndcg@10 0.7698
cosine_mrr@10 0.694
cosine_map@100 0.694

Information Retrieval

Metric Value
cosine_accuracy@1 0.5
cosine_accuracy@3 0.8333
cosine_accuracy@5 0.8889
cosine_accuracy@10 0.9444
cosine_precision@1 0.5
cosine_precision@3 0.2778
cosine_precision@5 0.1778
cosine_precision@10 0.0944
cosine_recall@1 0.5
cosine_recall@3 0.8333
cosine_recall@5 0.8889
cosine_recall@10 0.9444
cosine_ndcg@10 0.7437
cosine_mrr@10 0.6775
cosine_map@100 0.68

Information Retrieval

Metric Value
cosine_accuracy@1 0.4444
cosine_accuracy@3 0.6667
cosine_accuracy@5 0.8333
cosine_accuracy@10 1.0
cosine_precision@1 0.4444
cosine_precision@3 0.2222
cosine_precision@5 0.1667
cosine_precision@10 0.1
cosine_recall@1 0.4444
cosine_recall@3 0.6667
cosine_recall@5 0.8333
cosine_recall@10 1.0
cosine_ndcg@10 0.7008
cosine_mrr@10 0.6076
cosine_map@100 0.6076

Training Details

Training Dataset

Unnamed Dataset

  • Size: 161 training samples
  • Columns: positive and anchor
  • Approximate statistics based on the first 1000 samples:
    positive anchor
    type string string
    details
    • min: 5 tokens
    • mean: 40.09 tokens
    • max: 481 tokens
    • min: 7 tokens
    • mean: 13.01 tokens
    • max: 24 tokens
  • Samples:
    positive anchor
    The DPA may impose administrative fines of up to €10 million, or up to 2%
    of
    worldwide turnover. The DPA may also impose heavier fines up to €20 million,
    or up to 4% of worldwide turnover.
    What is the penalty for non-compliance with the GDPR in Italy?
    As per the DPA, the data handler must seek consent in writing from the data subject to collect any sensitive personal data. What are the consent requirements under the DPA?
    China's cybersecurity laws include the Cybersecurity Law, which governs
    various aspects of cybersecurity, data protection, and the obligations of
    organizations to ensure the security of networks and data within China's
    territory.
    What are the cybersecurity laws in China?
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128,
            64
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: epoch
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • gradient_accumulation_steps: 2
  • learning_rate: 2e-05
  • num_train_epochs: 4
  • lr_scheduler_type: cosine
  • warmup_ratio: 0.1
  • bf16: True
  • tf32: True
  • load_best_model_at_end: True
  • optim: adamw_torch_fused
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: epoch
  • prediction_loss_only: True
  • per_device_train_batch_size: 32
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 2
  • eval_accumulation_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 4
  • max_steps: -1
  • lr_scheduler_type: cosine
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: True
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss dim_128_cosine_map@100 dim_256_cosine_map@100 dim_512_cosine_map@100 dim_64_cosine_map@100 dim_768_cosine_map@100
1.0 3 - 0.6555 0.6686 0.6395 0.5554 0.6469
2.0 6 - 0.6701 0.6821 0.6701 0.5910 0.6951
3.0 9 - 0.6706 0.6940 0.6701 0.6076 0.7025
3.3333 10 5.2757 - - - - -
4.0 12 - 0.68 0.694 0.6747 0.6076 0.7011
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.10.14
  • Sentence Transformers: 3.0.1
  • Transformers: 4.41.2
  • PyTorch: 2.1.2+cu121
  • Accelerate: 0.31.0
  • Datasets: 2.19.1
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning}, 
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply}, 
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}