Pre-trained T5-base model on PseudoMD-1M datasets.

PseudoMD-1M dataset is the first artificially-real dataset for cross-modal molecule discovery, which consists of 1,020,139 pseudo molecule-description pairs. Every molecule is represented using its Canonical SMILES notation, sourced from PubChem via the PUG View API. On average, each description within PseudoMD-1M contains 5.11 sentences, 106.47 words, and 165.07 tokens. We provide five examples in Appendix A in the paper.

Pre-training details

Parameters N
Corpus Size 1,020,139
Training Steps 100,000
Learning Rate 1e-3
Batch Size 128
Warm-up Steps 1000
Weight decay 0.1

Example Usage

from transformers import AutoTokenizer, T5ForConditionalGeneration

model = T5ForConditionalGeneration.from_pretrained("SCIR-HI/ada-t5-base")
tokenizer = AutoTokenizer.from_pretrained("SCIR-HI/ada-t5-base", model_max_length=512)

Citation

@article{chen2023artificially,
  title={From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery},
  author={Chen, Yuhan and Xi, Nuwa and Du, Yanrui and Wang, Haochun and Jianyu, Chen and Zhao, Sendong and Qin, Bing},
  journal={arXiv preprint arXiv:2309.05203},
  year={2023}
}
Downloads last month
30
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.