LChemME (base size) trained on ZINC22 fragments

LChemME pre-trained using our LChemME python package on canonicalizing SMILES strings below 300 Da from ZINC22.

Model description

LChemME is a Large Chemical Model for Embedding based on the BART architecture. BART is a transformer encoder-decoder model.

LChemME uses a relatively small vocabulary size (512) relative to natural language models. LChemME models are pretrained on the task of SMILES canonicalization (according to RDKit rules). This task requires the model to build an internal representation of the chemical graph directly from the SMILES string and decode the graph back to a canonical SMILES.

This checkpoint results from pretraining on 1.2 million SMILES strings from ZINC22 with molecular weight less than 300 Da. The validation dataset comprised molecules with molecular weight greater than 350 Da. We aim for this LChemME model to assist with generalizing chemical property prediction from measurements on chemical fragments.

How to use

Here is how to use this model in PyTorch:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('scbirlab/lchemme-base-zinc22-lteq300')
model = AutoModelForSeq2SeqLM.from_pretrained('scbirlab/lchemme-base-zinc22-lteq300')

inputs = tokenizer("CC(Oc1ccccc1C(O)=O)=O", return_tensors="pt")
outputs = model(**inputs)

last_hidden_states = outputs.last_hidden_state
Downloads last month
22,446
Safetensors
Model size
101M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for scbirlab/lchemme-base-zinc22-lteq300

Base model

facebook/bart-base
Finetuned
(400)
this model

Space using scbirlab/lchemme-base-zinc22-lteq300 1

Collection including scbirlab/lchemme-base-zinc22-lteq300