File size: 4,504 Bytes
2a480bc f08d3f7 2a480bc 5e08371 f08d3f7 5e08371 f08d3f7 313ffc2 577e72b f08d3f7 313ffc2 5e08371 f08d3f7 b84d2d1 5e08371 b84d2d1 5e08371 b84d2d1 5e08371 b84d2d1 edc4692 b84d2d1 5e08371 b84d2d1 f08d3f7 edc4692 f08d3f7 edc4692 f08d3f7 edc4692 f08d3f7 f1f77c7 7ea09c7 f07aa5b f08d3f7 5e08371 f07aa5b f08d3f7 5e08371 f08d3f7 2617fbd f08d3f7 5e08371 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
---
license: apache-2.0
metrics:
- accuracy
- bleu
pipeline_tag: text2text-generation
tags:
- chemistry
- biology
- medical
- smiles
- iupac
- text-generation-inference
widget:
- text: CCO
example_title: ethanol
---
# SMILES2IUPAC-canonical-small
SMILES2IUPAC-canonical-small was designed to accurately translate SMILES chemical names to IUPAC standards.
## Model Details
### Model Description
SMILES2IUPAC-canonical-small is based on the MT5 model with optimizations in implementing different tokenizers for the encoder and decoder.
- **Developed by:** Knowladgator Engineering
- **Model type:** Encoder-Decoder with attention mechanism
- **Language(s) (NLP):** SMILES, IUPAC (English)
- **License:** Apache License 2.0
### Model Sources
- **Paper:** coming soon
- **Demo:** [ChemicalConverters](https://huggingface.co/spaces/knowledgator/ChemicalConverters)
## Quickstart
Firstly, install the library:
```commandline
pip install chemical-converters
```
### SMILES to IUPAC
#### ! Preferred IUPAC style
To choose the preferred IUPAC style, place style tokens before
your SMILES sequence.
| Style Token | Description |
|-------------|----------------------------------------------------------------------------------------------------|
| `<BASE>` | The most known name of the substance, sometimes is the mixture of traditional and systematic style |
| `<SYST>` | The totally systematic style without trivial names |
| `<TRAD>` | The style is based on trivial names of the parts of substances |
#### To perform simple translation, follow the example:
```python
from chemicalconverters import NamesConverter
converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small")
print(converter.smiles_to_iupac('CCO'))
print(converter.smiles_to_iupac(['<SYST>CCO', '<TRAD>CCO', '<BASE>CCO']))
```
```text
['ethanol']
['ethanol', 'ethanol', 'ethanol']
```
#### Processing in batches:
```python
from chemicalconverters import NamesConverter
converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small")
print(converter.smiles_to_iupac(["<BASE>C=CC=C" for _ in range(10)], num_beams=1,
process_in_batch=True, batch_size=1000))
```
```text
['buta-1,3-diene', 'buta-1,3-diene'...]
```
#### Validation SMILES to IUPAC translations
It's possible to validate the translations by reverse translation into IUPAC
and calculating Tanimoto similarity of two molecules fingerprints.
````python
from chemicalconverters import NamesConverter
converter = NamesConverter(model_name="knowledgator/SMILES2IUPAC-canonical-small")
print(converter.smiles_to_iupac('CCO', validate=True))
````
````text
['ethanol'] 1.0
````
The larger is Tanimoto similarity, the larger is probability, that the prediction was correct.
You can also process validation manually:
```python
from chemicalconverters import NamesConverter
validation_model = NamesConverter(model_name="knowledgator/IUPAC2SMILES-canonical-base")
print(NamesConverter.validate_iupac(input_sequence='CCO', predicted_sequence='CCO', validation_model=validation_model))
```
```text
1.0
```
## Bias, Risks, and Limitations
This model has limited accuracy in processing large molecules and currently, doesn't support isomeric and isotopic SMILES.
### Training Procedure
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
The model was trained on 100M examples of SMILES-IUPAC pairs with lr=0.0003, batch_size=1024 for 2 epochs.
## Evaluation
| Model | Accuracy | BLEU-4 score | Size(MB) |
|-------------------------------------|---------|------------------|----------|
| SMILES2IUPAC-canonical-small |75%| 0.93 | 23 |
| SMILES2IUPAC-canonical-base |86.9%|0.964|180|
| STOUT V2.0\* | 66.65% | 0.92 | 128 |
| STOUT V2.0 (according to our tests) | | 0.89 | 128 |
*According to the original paper https://jcheminf.biomedcentral.com/articles/10.1186/s13321-021-00512-4
## Citation
Coming soon.
## Model Card Authors
[Mykhailo Shtopko](https://huggingface.co/BioMike)
## Model Card Contact
[email protected] |