license: apache-2.0
metrics:
- accuracy
- bleu
pipeline_tag: text2text-generation
tags:
- chemistry
- biology
- medical
- smiles
- iupac
- text-generation-inference
widget:
- text: CCO
example_title: ethanol
SMILES2IUPAC-canonical-small
SMILES2IUPAC-canonical-small was designed to accurately translate SMILES chemical names to IUPAC standards.
Model Details
Model Description
SMILES2IUPAC-canonical-small is based on the MT5 model with optimizations in implementing different tokenizers for the encoder and decoder.
- Developed by: Knowladgator Engineering
- Model type: Encoder-Decoder with attention mechanism
- Language(s) (NLP): SMILES, IUPAC (English)
- License: Apache License 2.0
Model Sources
- Paper: coming soon
- Demo: ChemicalConverters
Quickstart
Firstly, install the library:
pip install chemical-converters
SMILES to IUPAC
! Preferred IUPAC style
To choose the preferred IUPAC style, place style tokens before your SMILES sequence.
Style Token | Description |
---|---|
<BASE> |
The most known name of the substance, sometimes is the mixture of traditional and systematic style |
<SYST> |
The totally systematic style without trivial names |
<TRAD> |
The style is based on trivial names of the parts of substances |
To perform simple translation, follow the example:
from chemicalconverters import NamesConverter
converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small")
print(converter.smiles_to_iupac('CCO'))
print(converter.smiles_to_iupac(['<SYST>CCO', '<TRAD>CCO', '<BASE>CCO']))
['ethanol']
['ethanol', 'ethanol', 'ethanol']
Processing in batches:
from chemicalconverters import NamesConverter
converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small")
print(converter.smiles_to_iupac(["<BASE>C=CC=C" for _ in range(10)], num_beams=1,
process_in_batch=True, batch_size=1000))
['buta-1,3-diene', 'buta-1,3-diene'...]
Validation SMILES to IUPAC translations
It's possible to validate the translations by reverse translation into IUPAC and calculating Tanimoto similarity of two molecules fingerprints.
from chemicalconverters import NamesConverter
converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small")
print(converter.smiles_to_iupac('CCO', validate=True))
['ethanol'] 1.0
The larger is Tanimoto similarity, the larger is probability, that the prediction was correct.
You can also process validation manually:
from chemicalconverters import NamesConverter
validation_model = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
print(NamesConverter.validate_iupac(input_sequence='CCO', predicted_sequence='CCO', validation_model=validation_model))
1.0
Bias, Risks, and Limitations
This model has limited accuracy in processing large molecules and currently, doesn't support isomeric and isotopic SMILES.
Training Procedure
The model was trained on 100M examples of SMILES-IUPAC pairs with lr=0.0003, batch_size=1024 for 2 epochs.
Evaluation
Model | Accuracy | BLEU-4 score | Size(MB) |
---|---|---|---|
SMILES2IUPAC-canonical-small | 75% | 0.93 | 23 |
SMILES2IUPAC-canonical-base | 86.9% | 0.964 | 180 |
STOUT V2.0* | 66.65% | 0.92 | 128 |
STOUT V2.0 (according to our tests) | 0.89 | 128 | |
*According to the original paper https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00512-4 |
Citation
Coming soon.