regression_transformer / model_cards /regression_transformer_article.md
jannisborn's picture
update
038f83b unverified

A newer version of the Gradio SDK is available: 5.12.0

Upgrade

Model documentation & parameters

Parameters

Algorithm Version

Which model checkpoint to use (trained on different datasets).

Task

Whether the multitask model should be used for property prediction or conditional generation (default).

Input

The input sequence. In the default setting (where Task is Generate and Sampling Wrapper is True) this can be a seed SMILES (for the molecule models) or amino-acid sequence (for the protein models). The model will locally adapt the seed sequence by masking Fraction to mask of the tokens. If the Task is Predict, the sequences are given as SELFIES for the molecule models. Moreover, the tokens that should be predicted ([MASK] in the input) have to be given explicitly. Populate the examples to understand better. NOTE: When setting Task to Generate, and Sampling Wrapper to False, the user has maximal control about the generative process and can explicitly decide which tokens should be masked.

Number of samples

How many samples should be generated (between 1 and 50). If Task is Predict, this has to be set to 1.

Search

Decoding search method. Use Sample if Task is Generate. If Task is Predict, use Greedy.

Tolerance

Precision tolerance; only used if Task is Generate. This is a single float between 0 and 100 for the the tolerated deviation between desired/primed property and predicted property of the generated molecule. Given in percentage with respect to the property range encountered during training. NOTE: The tolerance is only used for post-hoc filtering of the generated samples.

Sampling Wrapper

Only used if Task is Generate. If set to False, the user has to provide a full RT-sequence as Input and has to explicitly decide which tokens are masked (see example below). This gives full control but is tedious. Instead, if Sampling Wrapper is set to True, the RT stochastically determines which parts of the sequence are masked. NOTE: All below arguments only apply if Sampling Wrapper is True.

Fraction to mask

Specifies the ratio of tokens that can be changed by the model. Argument only applies if Task is Generate and Sampling Wrapper is True.

Property goal

Specifies the desired target properties for the generation. Need to be given in the format <prop>:value. If the model supports multiple properties, give them separated by a comma ,. Argument only applies if Task is Generate and Sampling Wrapper is True.

Tokens to mask

Optionally specifies which tokens (atoms, bonds etc) can be masked. Please separate multiple tokens by comma (,). If not specified, all tokens can be masked. Argument only applies if Task is Generate and Sampling Wrapper is True.

Substructures to mask

Optionally specifies a list of substructures that should definitely be masked (excluded from stochastic masking). Given in SMILES format. If multiple are provided, separate by comma (,). Argument only applies if Task is Generate and Sampling Wrapper is True. NOTE: Most models operate on SELFIES and the matching of the substructures occurs in SELFIES simply on a string level.

Substructures to keep

Optionally specifies a list of substructures that should definitely be present in the target sample (i.e., excluded from stochastic masking). Given in SMILES format. Argument only applies if Task is Generate and Sampling Wrapper is True. NOTE: This keeps tokens even if they are included in tokens_to_mask. NOTE: Most models operate on SELFIES and the matching of the substructures occurs in SELFIES simply on a string level.

Model card -- Regression Transformer

Model Details: The Regression Transformer is a multitask Transformer that reformulates regression as a conditional sequence modeling task. This yields a dichotomous language model that seamlessly integrates property prediction with property-driven conditional generation.

Developers: Jannis Born and Matteo Manica from IBM Research.

Distributors: Original authors' code wrapped and distributed by GT4SD Team (2023) from IBM Research.

Model date: Preprint released in 2022, currently under review at Nature Machine Intelligence.

Algorithm version: Models trained and distributed by the original authors.

  • Molecules: QED: Model trained on 1.6M molecules (SELFIES) from ChEMBL and their QED scores.
  • Molecules: Solubility: QED model finetuned on the ESOL dataset from Delaney et al (2004), J. Chem. Inf. Comput. Sci. to predict water solubility. Model trained on augmented SELFIES.
  • Molecules: Cosmo_acdl: Model finetuned on 56k molecules with two properties (pKa_ACDL and pKa_COSMO). Model used augmented SELFIES.
  • Molecules: Pfas: Model finetuned on ~1k PFAS (Perfluoroalkyl and Polyfluoroalkyl Substances) molecules with 9 properties including some experimentally measured ones (biodegradability, LD50 etc) and some synthetic ones (SCScore, molecular weight). Model trained on augmented SELFIES.
  • Molecules: Logp_and_synthesizability: Model trained on 2.9M molecules (SELFIES) from PubChem with two synthetic properties, the logP (partition coefficient) and the SCScore by Coley et al. (2018); J. Chem. Inf. Model.
  • Molecules: Crippen_logp: Model trained on 2.9M molecules (SMILES) from PubChem, but only on logP (partition coefficient).
  • Molecules: Reactions: USPTO: Model trained on 2.8M chemical reactions from the US patent office. The model used SELFIES and a synthetic property (total molecular weight of all precursors).
  • Molecules: Polymers: ROP Catalyst: Model finetuned on 600 ROPs (ring-opening polymerizations) with monomer-catalyst pairs. Model used three properties: conversion (<conv>), PDI (<pdi>) and Molecular Weight (<molwt>). Model trained with augmented SELFIES, optimized only to generate catalysts, given a monomer and the property constraints. Try the above UI example and see Park et al., (2022, ChemRxiv) for details.
  • Molecules: Polymers: Block copolymer: Model finetuned on ~1k block copolymers with a novel string representation developed for Polymers. Model used two properties: dispersity (<Dispersity>) and MnGPC (<MnGPC>). This is the first generative model for block copolymers. Try the above UI example and see Park et al., (2022, ChemRxiv) for details.
  • Proteins: Stability: Model pretrained on 2.6M peptides from UniProt with the Boman index as property. Finetuned on the Stability dataset from the TAPE benchmark which has ~65k samples.

Model type: A Transformer-based language model that is trained on alphanumeric sequence to simultaneously perform sequence regression or conditional sequence generation.

Information about training algorithms, parameters, fairness constraints or other applied approaches, and features: All models are trained with an alternated training scheme that alternated between optimizing the cross-entropy loss on the property tokens ("regression") or the self-consistency objective on the molecular tokens. See the Regression Transformer paper for details.

Paper or other resource for more information: The Regression Transformer paper. See the source code for details.

License: MIT

Where to send questions or comments about the model: Open an issue on GT4SD repository.

Intended Use. Use cases that were envisioned during development: Chemical research, in particular drug discovery.

Primary intended uses/users: Researchers and computational chemists using the model for model comparison or research exploration purposes.

Out-of-scope use cases: Production-level inference, producing molecules with harmful properties.

Factors: Not applicable.

Metrics: High predictive power for the properties of that specific algorithm version.

Datasets: Different ones, as described under Algorithm version.

Ethical Considerations: No specific considerations as no private/personal data is involved. Please consult with the authors in case of questions.

Caveats and Recommendations: Please consult the authors in case of questions.

Model card prototype inspired by Mitchell et al. (2019)

Citation

@article{born2023regression,
  title={Regression Transformer enables concurrent sequence regression and generation for molecular language modelling},
  author={Born, Jannis and Manica, Matteo},
  journal={Nature Machine Intelligence},
  volume={5},
  number={4},
  pages={432--444},
  year={2023},
  publisher={Nature Publishing Group UK London}
}