GPT-2 for SMILES Unconditional Generation
This repository hosts a GPT-2-based model for generating SMILES strings, trained on the ZINC 15 dataset. The model follows the architecture and hyperparameter setup of MolGPT (Bagal et al., 2021), and has been fine-tuned to generate valid molecular representations with high accuracy.
Model Overview
Architecture and Configuration
The model is built using the GPT-2 base architecture, with the following configuration:
GPT2Config(
vocab_size=tokenizer.vocab_size, # 10,000 tokens
n_positions=128,
n_ctx=128,
n_embd=256,
n_layer=8,
n_head=8,
resid_pdrop=0.1,
embd_pdrop=0.1,
attn_pdrop=0.1
)
The tokenizer was custom-trained using Byte Pair Encoding (BPE) with a vocabulary size of 10,000 tokens. Below is the tokenizer configuration:
def configure_tokenizer(tokenizer_path):
tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path)
tokenizer.model_max_length = 128
tokenizer.pad_token = "<pad>"
tokenizer.bos_token = "<bos>"
tokenizer.eos_token = "<eos>"
return tokenizer
Training Details
- Pretraining Dataset: ZINC 15 (35,965,323 SMILES strings)
- Hardware: 8 NVIDIA RTX 2080 Ti GPUs
- Training Time: 16 hours
- Hyperparameters:
TrainingArguments(
output_dir=output_dir,
evaluation_strategy="steps",
learning_rate=5e-4,
max_steps=100_000,
per_device_train_batch_size=128,
save_steps=10_000,
save_total_limit=3,
logging_dir=f"{output_dir}/logs",
logging_steps=10_000,
warmup_steps=10_000,
dataloader_num_workers=4,
gradient_accumulation_steps=1,
fp16=True
)
Model Performance
The validity of generated SMILES was evaluated by generating 10,000 sequences with a fixed temperature of 1. The results were compared to the original MolGPT model:
Dataset/Metric | GPT-2 (This Model) | MolGPT |
---|---|---|
ZINC 15 Validity | 99.68% | N/A |
MOSES Validity | N/A | 99.4% |
GuacaMol Validity | N/A | 98.1% |
Usage
Install Dependencies Install the required libraries via pip:
pip install transformers
pip install tokenizers
Load the Model and Tokenizer To use the model for SMILES generation, follow these steps:
Load tokenizer
from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast
tokenizer_path = "path_to_tokenizer/tokenizer.json"
tokenizer = PreTrainedTokenizerFast(tokenizer_file=tokenizer_path)
tokenizer.pad_token = "<pad>"
tokenizer.bos_token = "<bos>"
tokenizer.eos_token = "<eos>"
Load model
from transformers import GPT2LMHeadModel
model = GPT2LMHeadModel.from_pretrained("jonghyunlee/MolGPT_pretrained-by-ZINC15")
Generate SMILES
def generate_smiles(model, tokenizer, num_sequences=1000, temperature=1.0):
return model.generate(
max_length=128,
num_return_sequences=num_sequences,
pad_token_id=tokenizer.pad_token_id,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id,
do_sample=True,
temperature=temperature,
return_dict_in_generate=True,
)
Decode generated SMILES
generated_smiles = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
print(generated_smiles)
Citation
If you use this model in your research, please cite:
Bagal, V., Aggarwal, R., Vinod, P. K., & Priyakumar, U. D. (2021). MolGPT: molecular generation using a transformer-decoder model. Journal of Chemical Information and Modeling, 62(9), 2064-2076.
- Downloads last month
- 9