Model Card for Model hogru/MolReactGen-GuacaMol-Molecules

MolReactGen is a model that generates molecules in SMILES format (this model) and reaction templates in SMARTS format.

Model Details

Model Description

MolReactGen is based on the the GPT-2 transformer decoder architecture and has been trained on the GuacaMol dataset. More information can be found in these introductory slides.

Developed by: Stephan Holzgruber
Model type: Transformer decoder
License: MIT

Model Sources

Repository: https://github.com/hogru/MolReactGen
Presentation: https://github.com/hogru/MolReactGen/blob/main/presentations/Slides%20(A4%20size).pdf
Poster: https://github.com/hogru/MolReactGen/blob/main/presentations/Poster%20(A0%20size).pdf

Uses

The main use of this model is to pass the master's examination of the author ;-)

Direct Use

The model can be used in a Hugging Face text generation pipeline. For the intended use case a wrapper around the raw text generation pipeline is needed. This is the generate.py from the repository. The model has a default GenerationConfig() (generation_config.json) which can be overwritten. Depending on the number of molecules to be generated (num_return_sequences in the JSON file) this might take a while. The generation code above shows a progress bar during generation.

Bias, Risks, and Limitations

The model generates molecules that are similar to the GuacaMol training data, which itself is based on ChEMBL. Any checks of the molecules, e.g. chemical feasiblitly, must be adressed by the user of the model.

Training Details

Training Data

GuacaMol dataset

Training Procedure

The default Hugging Face Trainer() has been used, with an EarlyStoppingCallback().

Preprocessing

The training data was pre-processed with a PreTrainedTokenizerFast() trained on the training data with a character level pre-tokenizer and Unigram as the sub-word tokenization algorithm with a vocabulary size of 88. Other tokenizers can be configured.

Training Hyperparameters

Batch size: 64
Gradient accumulation steps: 4
Mixed precision: fp16, native amp
Learning rate: 0.0025
Learning rate scheduler: Cosine
Learning rate scheduler warmup: 0.1
Optimizer: AdamW with betas=(0.9,0.95) and epsilon=1e-08
Number of epochs: 50

More configuration (options) can be found in the conf directory of the repository.

Evaluation

Please see the slides / the poster mentioned above.

Metrics

Please see the slides / the poster mentioned above.

Results

Please see the slides / the poster mentioned above.

Technical Specifications

Framework versions

Transformers 4.27.1
Pytorch 1.13.1
Datasets 2.10.1
Tokenizers 0.13.2

Hardware

Local PC running Ubuntu 22.04
NVIDIA GEFORCE RTX 3080Ti (12GB)