MPNet base trained on sxc_med_llm_chemical_gen

This is a sentence-transformers model finetuned from microsoft/mpnet-base on the sxc_med_llm_chemical_gen dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: microsoft/mpnet-base
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 tokens
  • Similarity Function: Cosine Similarity
  • Training Dataset:
  • Language: en
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Saideepthi55/sentencetransformer_ftmodel_on_chemical_dataset")
# Run inference
sentences = [
    'With a molecule represented by the SMILES string CNNNCC(=O)N[C@H](C)C[C@@H](C)NCc1ccc2c(c1)CCC2, propose adjustments that can increase its logP value while keeping the output molecule structurally related to the input molecule.',
    'Given a molecule expressed in SMILES string, help me optimize it according to my requirements.',
    'In line with your criteria, I\'ve optimized the molecule and present it as "C[C@H](C[C@@H](C)NC(=O)COC(C)(C)C)NCc1ccc2c(c1)CCC2".',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Triplet

Metric Value
cosine_accuracy 0.9838
dot_accuracy 0.0162
manhattan_accuracy 0.9827
euclidean_accuracy 0.9836
max_accuracy 0.9838

Training Details

Training Dataset

sxc_med_llm_chemical_gen

  • Dataset: sxc_med_llm_chemical_gen at ae28874
  • Size: 117,502 training samples
  • Columns: input, instruction, and output
  • Approximate statistics based on the first 1000 samples:
    input instruction output
    type string string string
    details
    • min: 2 tokens
    • mean: 50.03 tokens
    • max: 156 tokens
    • min: 6 tokens
    • mean: 16.19 tokens
    • max: 69 tokens
    • min: 3 tokens
    • mean: 67.94 tokens
    • max: 321 tokens
  • Samples:
    input instruction output
    List the countries of the Middle East Egypt, Israel, Iran, Iraq, Jordan, Kuwait, Lebanon, Qatar, Saudi Arabia, Syria, Turkey, The United Arab Emirates, Oman, and Yemen.
    Support me in refining the molecule CCC(C)=C(CC)c1ccc(Nc2nc3nonc3nc2N2CCC(Oc3ccc(Cl)cc3)CC2)cc1 to have a more drug-like nature, ensuring it remains close to the input molecule. Please inform me of the QED value of the optimized molecule. Help me adapt a molecular structure denoted in SMILES string based on my preferences. I've optimized the molecule according to your requirements, yielding the structure "COC(=O)c1ccc(Nc2nc3nonc3nc2N2CCC(Oc3ccc(Cl)cc3)CC2)cc1" and an estimated QED value of 0.4.
    Create a social media strategy to attract followers on Twitter. The social media strategy for Twitter should involve Tweeting on a regular basis, with content that is engaging and relevant to the target audience. This can include retweetable quotes, links to videos and articles, sharing photos/images, polls, and so on. It is also beneficial for the information to include hashtags relevant to the content and any relevant topics of the industry.

    It is also important to leverage Twitter influencers and accounts, as well as engaging with potential followers regularly and DM them when necessary. Additionally, running campaigns, contests and giveaways to gain more followers should also be part of the strategy. Finally, hashtag research should be conducted to increase the reach of the tweets.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Evaluation Dataset

sxc_med_llm_chemical_gen

  • Dataset: sxc_med_llm_chemical_gen at ae28874
  • Size: 4,500 evaluation samples
  • Columns: input, instruction, and output
  • Approximate statistics based on the first 1000 samples:
    input instruction output
    type string string string
    details
    • min: 2 tokens
    • mean: 68.49 tokens
    • max: 165 tokens
    • min: 6 tokens
    • mean: 16.89 tokens
    • max: 36 tokens
    • min: 3 tokens
    • mean: 69.16 tokens
    • max: 205 tokens
  • Samples:
    input instruction output
    With a molecule represented by the SMILES string COc1ccc(CN(C(=O)Nc2ccc(C@Hc3ccccc3)c(Cl)c2)C2CC2)cc1, propose adjustments that can decrease its blood-brain barrier permeability (BBBP) value by at least 0.1 compared to the pre-optimized value while keeping the output molecule structurally related to the input molecule. Please inform me of the BBBP value of the optimized molecule. Help me adapt a molecular structure denoted in SMILES string based on my preferences. I've optimized the molecule according to your requirements, yielding the structure "CC(C)c1cccc(Oc2ccc(CN(C(=O)Nc3ccc(C@Hc4ccccc4)c(Cl)c3)C3CC3)cc2)c1" and an estimated BBBP value of 0.34.
    Which modifications can be applied to the molecule Cc1nn(-c2cc3cc(F)ccc3s2)c2ncc(C(=O)N3CCCC3c3nnc4n3CCCCC4)cc12 to increase its water solubility value while ensuring it remains similar to the original molecule? Aid me in refining a molecular structure written in SMILES notation based on my criteria. I've optimized the molecule according to your requirements, resulting in the following structure: "Cc1nn(C)c2ncc(C(=O)N3CCCC3c3nnc4n3CCCCC4)cc12".
    Describe the benefits of having a mentor. Having a mentor can provide tremendous benefits, including guidance, advice, support and connection to a network of contacts. Mentors can offer valuable feedback, help to create goals and strategies, and provide the accountability necessary to carry out the desired goals. They can also provide a fresh perspective which can help to create new ideas and solutions.
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • learning_rate: 2e-05
  • num_train_epochs: 1
  • warmup_ratio: 0.1
  • fp16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: False
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • eval_use_gather_object: False
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss Validation Loss sxc_med_llm_chemical_gen_max_accuracy
0 0 - - 0.7076
0.0136 100 4.1705 3.6314 0.7469
0.0272 200 3.0088 2.4771 0.8191
0.0408 300 2.3803 2.3765 0.8716
0.0545 400 2.2281 2.3122 0.9316
0.0681 500 2.1647 2.2997 0.9436
0.0817 600 2.1289 2.2663 0.9511
0.0953 700 2.0663 2.2601 0.9629
0.1089 800 2.065 2.2500 0.9687
0.1225 900 2.0399 2.2595 0.9693
0.1362 1000 1.9939 2.2375 0.9707
0.1498 1100 1.9858 2.2220 0.9684
0.1634 1200 2.0069 2.2265 0.9758
0.1770 1300 1.9591 2.2073 0.9702
0.1906 1400 1.9288 2.2078 0.976
0.2042 1500 1.9466 2.2036 0.9758
0.2179 1600 1.9175 2.2086 0.9764
0.2315 1700 1.8835 2.2329 0.9796
0.2451 1800 1.9134 2.2003 0.9796
0.2587 1900 1.8809 2.2003 0.9811
0.2723 2000 1.9263 2.2039 0.9824
0.2859 2100 1.9101 2.2084 0.9804
0.2996 2200 1.8846 2.2052 0.9831
0.3132 2300 1.8842 2.1903 0.9818
0.3268 2400 1.8945 2.1984 0.9807
0.3404 2500 1.9217 2.1859 0.9829
0.3540 2600 1.8704 2.1995 0.982
0.3676 2700 1.889 2.2038 0.9822
0.3813 2800 1.875 2.2079 0.9829
0.3949 2900 1.8792 2.1975 0.9833
0.4085 3000 1.882 2.1895 0.9796
0.4221 3100 1.8886 2.2115 0.9831
0.4357 3200 1.8629 2.2040 0.9838
0.4493 3300 1.8647 2.1973 0.9836
0.4630 3400 1.8888 2.1961 0.9838
0.4766 3500 1.8692 2.2027 0.9829
0.4902 3600 1.8846 2.1954 0.9838
0.5038 3700 1.8482 2.1888 0.9822
0.5174 3800 1.8527 2.1873 0.9824
0.5310 3900 1.8378 2.1940 0.9811
0.5447 4000 1.8679 2.2008 0.9833
0.5583 4100 1.8421 2.1845 0.9842
0.5719 4200 1.8325 2.1948 0.9847
0.5855 4300 1.8675 2.1750 0.9836
0.5991 4400 1.8483 2.1828 0.9831
0.6127 4500 1.854 2.1886 0.9831
0.6264 4600 1.827 2.1876 0.9824
0.6400 4700 1.8863 2.1849 0.9836
0.6536 4800 1.8919 2.1816 0.984
0.6672 4900 1.8211 2.1830 0.9847
0.6808 5000 1.8345 2.1847 0.9842
0.6944 5100 1.8685 2.1855 0.9853
0.7081 5200 1.85 2.1864 0.9844
0.7217 5300 1.8222 2.1875 0.9842
0.7353 5400 1.8179 2.1923 0.9844
0.7489 5500 1.7992 2.1909 0.9851
0.7625 5600 1.8495 2.1811 0.9847
0.7761 5700 1.808 2.1763 0.9842
0.7898 5800 1.8293 2.1861 0.9849
0.8034 5900 1.8184 2.1845 0.9851
0.8170 6000 1.8256 2.1956 0.9849
0.8306 6100 1.7904 2.1916 0.9842
0.8442 6200 1.8028 2.1918 0.9847
0.8578 6300 1.8316 2.1917 0.9856
0.8715 6400 1.7951 2.1929 0.9851
0.8851 6500 1.8175 2.1866 0.9847
0.8987 6600 1.8071 2.1899 0.9853
0.9123 6700 1.8632 2.1905 0.9844
0.9259 6800 1.8441 2.1885 0.984
0.9395 6900 1.8243 2.1865 0.9836
0.9532 7000 1.8055 2.1852 0.9842
0.9668 7100 1.8227 2.1843 0.984
0.9804 7200 1.8287 2.1831 0.984
0.9940 7300 1.8379 2.1838 0.9838

Framework Versions

  • Python: 3.10.12
  • Sentence Transformers: 3.2.0
  • Transformers: 4.44.2
  • PyTorch: 2.4.1+cu121
  • Accelerate: 0.34.2
  • Datasets: 3.0.1
  • Tokenizers: 0.19.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
3
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Saideepthi55/sentencetransformer_ftmodel_on_chemical_dataset

Finetuned
(49)
this model

Dataset used to train Saideepthi55/sentencetransformer_ftmodel_on_chemical_dataset

Evaluation results