LChemME (base size) trained on ZINC22 fragments
LChemME pre-trained using our LChemME python package on canonicalizing SMILES strings below 300 Da from ZINC22.
Model description
LChemME is a Large Chemical Model for Embedding based on the BART architecture. BART is a transformer encoder-decoder model.
LChemME uses a relatively small vocabulary size (512) relative to natural language models. LChemME models are pretrained on the task of SMILES canonicalization (according to RDKit rules). This task requires the model to build an internal representation of the chemical graph directly from the SMILES string and decode the graph back to a canonical SMILES.
This checkpoint results from pretraining on 1.2 million SMILES strings from ZINC22 with molecular weight less than 300 Da. The validation dataset comprised molecules with molecular weight greater than 350 Da. We aim for this LChemME model to assist with generalizing chemical property prediction from measurements on chemical fragments.

How to use
Here is how to use this model in PyTorch:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('scbirlab/lchemme-base-zinc22-lteq300')
model = AutoModelForSeq2SeqLM.from_pretrained('scbirlab/lchemme-base-zinc22-lteq300')
inputs = tokenizer("CC(Oc1ccccc1C(O)=O)=O", return_tensors="pt")
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state
- Downloads last month
- 22,446
Model tree for scbirlab/lchemme-base-zinc22-lteq300
Base model
facebook/bart-base