eduardosoares99
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,136 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
---
|
4 |
+
# SMILES-based Transformer Encoder-Decoder (SMI-TED)
|
5 |
+
|
6 |
+
This repository provides PyTorch source code associated with our publication, "A Large Encoder-Decoder Family of Foundation Models for Chemical Language".
|
7 |
+
|
8 |
+
Paper: [Arxiv Link](paper/smi_ted_preprint.pdf)
|
9 |
+
|
10 |
+
For model weights contact: [email protected] or [email protected] .
|
11 |
+
|
12 |
+
## Introduction
|
13 |
+
|
14 |
+
We present a large encoder-decoder chemical foundation model, SMILES-based Transformer Encoder-Decoder (SMI-TED), pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, equivalent to 4 billion molecular tokens. SMI-TED supports various complex tasks, including quantum property prediction, with two main variants ($289M$ and $8 \times 289M$). Our experiments across multiple benchmark datasets demonstrate state-of-the-art performance for various tasks. For model weights contact: [email protected] or [email protected] .
|
15 |
+
|
16 |
+
## Table of Contents
|
17 |
+
|
18 |
+
1. [Getting Started](#getting-started)
|
19 |
+
1. [Pretrained Models and Training Logs](#pretrained-models-and-training-logs)
|
20 |
+
2. [Replicating Conda Environment](#replicating-conda-environment)
|
21 |
+
2. [Pretraining](#pretraining)
|
22 |
+
3. [Finetuning](#finetuning)
|
23 |
+
4. [Feature Extraction](#feature-extraction)
|
24 |
+
5. [Citations](#citations)
|
25 |
+
|
26 |
+
## Getting Started
|
27 |
+
|
28 |
+
**This code and environment have been tested on Nvidia V100s and Nvidia A100s**
|
29 |
+
|
30 |
+
### Pretrained Models and Training Logs
|
31 |
+
|
32 |
+
We provide checkpoints of the SMI-TED model pre-trained on a dataset of ~91M molecules curated from PubChem. The pre-trained model shows competitive performance on classification and regression benchmarks from MoleculeNet. For model weights contact: [email protected] or [email protected] .
|
33 |
+
|
34 |
+
Add the SMI-TED `pre-trained weights.pt` to the `inference/` or `finetune/` directory according to your needs. The directory structure should look like the following:
|
35 |
+
|
36 |
+
```
|
37 |
+
inference/
|
38 |
+
βββ smi_ted_light
|
39 |
+
β βββ smi_ted_light.pt
|
40 |
+
β βββ bert_vocab_curated.txt
|
41 |
+
β βββ load.py
|
42 |
+
```
|
43 |
+
and/or:
|
44 |
+
|
45 |
+
```
|
46 |
+
finetune/
|
47 |
+
βββ smi_ted_light
|
48 |
+
β βββ smi_ted_light.pt
|
49 |
+
β βββ bert_vocab_curated.txt
|
50 |
+
β βββ load.py
|
51 |
+
```
|
52 |
+
|
53 |
+
### Replicating Conda Environment
|
54 |
+
|
55 |
+
Follow these steps to replicate our Conda environment and install the necessary libraries:
|
56 |
+
|
57 |
+
#### Create and Activate Conda Environment
|
58 |
+
|
59 |
+
```
|
60 |
+
conda create --name smi-ted-env python=3.8.18
|
61 |
+
conda activate smi-ted-env
|
62 |
+
```
|
63 |
+
|
64 |
+
#### Install Packages with Conda
|
65 |
+
|
66 |
+
```
|
67 |
+
conda install pytorch=1.13.1 cudatoolkit=11.4 -c pytorch
|
68 |
+
conda install numpy=1.23.5 pandas=2.0.3
|
69 |
+
conda install rdkit=2021.03.5 -c conda-forge
|
70 |
+
```
|
71 |
+
|
72 |
+
#### Install Packages with Pip
|
73 |
+
|
74 |
+
```
|
75 |
+
pip install transformers==4.6.0 pytorch-fast-transformers==0.4.0 torch-optimizer==0.3.0 datasets==1.6.2 scikit-learn==1.3.2 scipy==1.12.0 tqdm==4.66.1
|
76 |
+
```
|
77 |
+
|
78 |
+
## Pretraining
|
79 |
+
|
80 |
+
For pretraining, we use two strategies: the masked language model method to train the encoder part and an encoder-decoder strategy to refine SMILES reconstruction and improve the generated latent space.
|
81 |
+
|
82 |
+
SMI-TED is pre-trained on canonicalized and curated 91M SMILES from PubChem with the following constraints:
|
83 |
+
|
84 |
+
- Compounds are filtered to a maximum length of 202 tokens during preprocessing.
|
85 |
+
- A 95/5/0 split is used for encoder training, with 5% of the data for decoder pretraining.
|
86 |
+
- A 100/0/0 split is also used to train the encoder and decoder directly, enhancing model performance.
|
87 |
+
|
88 |
+
The pretraining code provides examples of data processing and model training on a smaller dataset, requiring 8 A100 GPUs.
|
89 |
+
|
90 |
+
To pre-train the two variants of the SMI-TED model, run:
|
91 |
+
|
92 |
+
```
|
93 |
+
bash training/run_model_light_training.sh
|
94 |
+
```
|
95 |
+
or
|
96 |
+
```
|
97 |
+
bash training/run_model_large_training.sh
|
98 |
+
```
|
99 |
+
|
100 |
+
Use `train_model_D.py` to train only the decoder or `train_model_ED.py` to train both the encoder and decoder.
|
101 |
+
|
102 |
+
## Finetuning
|
103 |
+
|
104 |
+
The finetuning datasets and environment can be found in the [finetune](finetune/) directory. After setting up the environment, you can run a finetuning task with:
|
105 |
+
|
106 |
+
```
|
107 |
+
bash finetune/smi_ted_light/esol/run_finetune_esol.sh
|
108 |
+
```
|
109 |
+
|
110 |
+
Finetuning training/checkpointing resources will be available in directories named `checkpoint_<measure_name>`.
|
111 |
+
|
112 |
+
## Feature Extraction
|
113 |
+
|
114 |
+
The example notebook [smi_ted_encoder_decoder_example.ipynb](notebooks/smi_ted_encoder_decoder_example.ipynb) contains code to load checkpoint files and use the pre-trained model for encoder and decoder tasks. It also includes examples of classification and regression tasks. For model weights contact: [email protected] or [email protected].
|
115 |
+
|
116 |
+
To load smi-ted, you can simply use:
|
117 |
+
|
118 |
+
```python
|
119 |
+
model = load_smi_ted(
|
120 |
+
folder='../inference/smi_ted_light',
|
121 |
+
ckpt_filename='smi_ted_light.pt'
|
122 |
+
)
|
123 |
+
```
|
124 |
+
|
125 |
+
To encode SMILES into embeddings, you can use:
|
126 |
+
|
127 |
+
```python
|
128 |
+
with torch.no_grad():
|
129 |
+
encoded_embeddings = model.encode(df['SMILES'], return_torch=True)
|
130 |
+
```
|
131 |
+
For decoder, you can use the function, so you can return from embeddings to SMILES strings:
|
132 |
+
|
133 |
+
```python
|
134 |
+
with torch.no_grad():
|
135 |
+
decoded_smiles = model.decode(encoded_embeddings)
|
136 |
+
```
|