ChemSolubilityBERTa

Model Description

ChemSolubilityBERTa is a prototype designed to predict the aqueous solubility of chemical compounds from their SMILES representations. Based on ChemBERTa, a BERT-like transformer-based architecture, ChemBERTa pre-trained on 77M SMILES strings for molecular property prediction. We adapted ChemBERTa to predict solubility values by fine-tuning ChemBERTa with the ESOL (Estimated SOLubility) dataset, a water solubility prediction dataset of 1,128 samples. A user inputs a SMILES string, and the model outputs a log solubility value (log mol/L). You can read the full paper here.

Fine-Tuning Details

  • Pretrained model: seyonec/ChemBERTa-zinc-base-v1
  • Dataset: ESOL (delaney-processed)
  • Task: Aqueous solubility prediction (log mol/L)
  • Number of training epochs: 3
  • Batch size: 16

How to Use

You can use the model to predict solubility for any molecule represented by a SMILES string:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("username/ChemSolubilityBERTa")
model = AutoModelForSequenceClassification.from_pretrained("username/ChemSolubilityBERTa")

smiles_string = "CCO"  # Example for ethanol
inputs = tokenizer(smiles_string, return_tensors='pt')
outputs = model(**inputs)
solubility = outputs.logits.item()
print(f"Predicted solubility: {solubility}")

Citation and Usage

If you use ChemSolubilityBERTa in your research or projects, please cite the following:

@misc{ChemSolubilityBERTa,
  author = {Farooq Khan},
  title = {ChemSolubilityBERTa: A Transformer-Based Model for Predicting Aqueous Solubility from SMILES},
  year = {2024},
  url = {https://huggingface.co/khanfs/ChemSolubilityBERTa}
}

License

This model is licensed under the MIT License.

Downloads last month
22
Safetensors
Model size
44.1M params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for khanfs/ChemSolubilityBERTa

Finetuned
(2)
this model

Space using khanfs/ChemSolubilityBERTa 1