SSM_100M

SSM_100M is a state space model (SSM) developed with the Mamba framework for molecular generation. The model was trained using the code from https://github.com/Anri-Lombard/Mamba-SAFE. It was trained from scratch on the ZINC dataset, converted from SMILES to the SAFE (SMILES Augmented For Encoding) format. SSM_100M leverages state space models' efficiency and scalability to match the performance of transformer-based models like SAFE_100M while using fewer computational resources.

Evaluation Results

SSM_100M performs similarly to the transformer-based SAFE_100M model in molecular generation, maintaining high validity and diversity of generated molecules. It achieves these results with lower computational overhead, making it a more resource-efficient option for large-scale applications.

Model Description

SSM_100M uses the Mamba framework's state space modeling to generate valid and diverse molecular structures efficiently. By converting the ZINC dataset from SMILES to SAFE format, the model benefits from improved molecular encoding, enhancing performance in areas such as:

Drug Discovery: Identifying potential drug candidates with optimal properties.
Materials Science: Designing novel materials with targeted characteristics.
Chemical Engineering: Developing new chemical processes and compounds more efficiently.

Mamba Framework

The Mamba framework underpins SSM_100M, offering a robust architecture for linear-time sequence modeling with selective state spaces. It was introduced in the following paper:

@article{gu2023mamba,
  title={Mamba: Linear-time sequence modeling with selective state spaces},
  author={Gu, Albert and Dao, Tri},
  journal={arXiv preprint arXiv:2312.00752},
  year={2023}
}

We thank the authors for their contributions to sequence modeling.

SAFE Framework

SSM_100M employs the SAFE framework to enhance molecular representation using the SMILES Augmented For Encoding format. The SAFE framework is detailed in the following publication:

@article{noutahi2024gotta,
  title={Gotta be SAFE: a new framework for molecular design},
  author={Noutahi, Emmanuel and Gabellini, Cristian and Craig, Michael and Lim, Jonathan SC and Tossou, Prudencio},
  journal={Digital Discovery},
  volume={3},
  number={4},
  pages={796--804},
  year={2024},
  publisher={Royal Society of Chemistry}
}

We appreciate the authors' invaluable work in molecular design.

Intended Uses & Limitations

Intended Uses

SSM_100M is suitable for:

Molecular Structure Generation: Creating new molecules with specific properties.
Chemical Space Exploration: Navigating the vast landscape of possible chemical compounds for research and development.
Material Design: Assisting in the creation of new materials with desired functionalities.

Limitations

Users should be aware of the following limitations:

Validation Required: Outputs should be validated by domain experts before use.
Synthetic Feasibility: Generated molecules may not always be synthesizable in the lab.
Dataset Boundaries: The model is limited to the chemical space of the ZINC dataset, which may restrict its applicability to novel or rare compounds outside this space.

Training and Evaluation Data

SSM_100M was trained on the ZINC dataset, a comprehensive collection of commercially available chemical compounds optimized for virtual screening. The dataset was converted from SMILES to SAFE format to improve molecular encoding for machine learning, enhancing the model's ability to generate meaningful and diverse molecular structures.

Training Procedure

Training Hyperparameters

SSM_100M was trained with the following hyperparameters:

Learning Rate: 0.0003
Training Batch Size: 64
Evaluation Batch Size: 64
Random Seed: 42
Gradient Accumulation Steps: 4
Total Training Batch Size: 256
Optimizer: Adam (betas=(0.9, 0.98), epsilon=1e-09)
Learning Rate Scheduler: Cosine with 50,000 warmup steps
Total Training Steps: 300,000
Model Parameters: 100M

Framework Versions

The training utilized the following software frameworks:

Mamba: 1.2.3
PyTorch: 2.0.1
Datasets: 2.20.0
Tokenizers: 0.19.1

Acknowledgements

We thank the authors and contributors of the following frameworks and datasets:

Mamba Framework: For providing a solid foundation for state space modeling.
SAFE Framework: For improving molecular representation with innovative encoding techniques.
ZINC Dataset Authors: For curating a comprehensive dataset essential for training effective molecular generation models.

For more information and updates, visit the Mamba-SAFE repository.

References

@inproceedings{
  lombard2024molecular,
  title={Molecular Generation with State Space Sequence Models},
  author={Anri Lombard and Shane Acton and Ulrich Armel Mbou Sob and Jan Buys},
  booktitle={NeurIPS 2024 Workshop on AI for New Drug Modalities},
  year={2024},
  url={https://openreview.net/forum?id=1ib5oyTQIb}
}

anrilombard
/

ssm-100m