--- datasets: - sagawa/ZINC-canonicalized library_name: transformers tags: - safe - datamol-io - molecule-design - smiles - generated_from_trainer model-index: - name: SAFE_100M results: [] --- # SAFE_100M SAFE_100M is a cutting-edge transformer-based model developed for molecular generation tasks. Trained from scratch on the [ZINC dataset](https://huggingface.co/datasets/sagawa/ZINC-canonicalized) converted to the SAFE (SMILES Augmented For Encoding) format, SAFE_100M achieves a loss of **0.3887** on its evaluation set, demonstrating robust performance in generating valid and diverse molecular structures. ## Table of Contents - [Model Description](#model-description) - [Intended Uses & Limitations](#intended-uses--limitations) - [Training and Evaluation Data](#training-and-evaluation-data) - [Training Procedure](#training-procedure) - [Training Hyperparameters](#training-hyperparameters) - [Framework Versions](#framework-versions) - [Acknowledgements](#acknowledgements) - [References](#references) ## Model Description SAFE_100M leverages the [SAFE framework](#references) to enhance molecular representation through the SMILES Augmented For Encoding format. By utilizing the comprehensive [ZINC dataset](https://huggingface.co/datasets/sagawa/ZINC-canonicalized), the model excels in navigating chemical space, making it highly effective for applications such as: - **Drug Discovery** - **Materials Science** - **Chemical Engineering** The transformer architecture ensures the generation of both valid and structurally diverse molecules, facilitating innovative solutions across various scientific disciplines. ## Intended Uses & Limitations ### Intended Uses SAFE_100M is designed to support: - **Molecular Structure Generation**: Creating novel molecules with desired properties. - **Chemical Space Exploration**: Identifying potential candidates for drug development. - **Material Design Assistance**: Innovating new materials with specific characteristics. ### Limitations While SAFE_100M is a powerful tool, users should be aware of the following limitations: - **Validation Requirement**: Outputs should be reviewed by domain experts before practical application. - **Synthetic Feasibility**: Generated molecules may not always be synthesizable in a laboratory setting. - **Dataset Boundaries**: The model's knowledge is confined to the chemical space represented in the ZINC dataset. ## Training and Evaluation Data The model was trained on the [ZINC dataset](https://huggingface.co/datasets/sagawa/ZINC-canonicalized), a large repository of commercially available chemical compounds optimized for virtual screening. This dataset was transformed into the SAFE format to enhance molecular encoding for machine learning applications. ## Training Procedure ### Training Hyperparameters SAFE_100M was trained with the following hyperparameters: - **Learning Rate**: `0.0001` - **Training Batch Size**: `100` - **Evaluation Batch Size**: `100` - **Random Seed**: `42` - **Gradient Accumulation Steps**: `2` - **Total Training Batch Size**: `200` - **Optimizer**: Adam (`betas=(0.9, 0.999)`, `epsilon=1e-08`) - **Learning Rate Scheduler**: Linear with `10,000` warmup steps - **Total Training Steps**: `250,000` ### Framework Versions The training process utilized the following software frameworks: - **Transformers**: `4.44.2` - **PyTorch**: `2.4.0+cu121` - **Datasets**: `2.20.0` - **Tokenizers**: `0.19.1` ## Acknowledgements We extend our gratitude to the authors of the SAFE framework for their significant contributions to the field of molecular design. ## References ```bibtex @article{noutahi2024gotta, title={Gotta be SAFE: a new framework for molecular design}, author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio}, journal={Digital Discovery}, volume={3}, number={4}, pages={796--804}, year={2024}, publisher={Royal Society of Chemistry} } ```