Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,43 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
tags:
|
6 |
+
- chemistry
|
7 |
+
- biology
|
8 |
+
- medical
|
9 |
---
|
10 |
+
### Pre-trained T5-base model on PseudoMD-1M datasets.
|
11 |
+
|
12 |
+
PseudoMD-1M dataset is the first artificially-real dataset for cross-modal molecule discovery, which consists of 1,020,139 pseudo molecule-description pairs. Every molecule is represented using its Canonical SMILES notation, sourced from PubChem via the PUG View API. On average, each description within PseudoMD-1M contains 5.11 sentences, 106.47 words, and 165.07 tokens. We provide five examples in Appendix A in the [paper](https://arxiv.org/abs/2309.05203).
|
13 |
+
|
14 |
+
|
15 |
+
### Pre-training details
|
16 |
+
| Parameters | N |
|
17 |
+
| ---- | ----|
|
18 |
+
| Corpus Size | 1,020,139 |
|
19 |
+
| Training Steps | 100,000|
|
20 |
+
| Learning Rate | 1e-3|
|
21 |
+
| Batch Size | 128 |
|
22 |
+
| Warm-up Steps | 1000|
|
23 |
+
| Weight decay| 0.1|
|
24 |
+
|
25 |
+
### Example Usage
|
26 |
+
|
27 |
+
```python
|
28 |
+
from transformers import AutoTokenizer, T5ForConditionalGeneration
|
29 |
+
|
30 |
+
model = T5ForConditionalGeneration.from_pretrained("SCIR-HI/ada-t5-base")
|
31 |
+
tokenizer = AutoTokenizer.from_pretrained("SCIR-HI/ada-t5-base", model_max_length=512)
|
32 |
+
```
|
33 |
+
|
34 |
+
### [Citation](https://arxiv.org/abs/2309.05203)
|
35 |
+
|
36 |
+
```bibtex
|
37 |
+
@article{chen2023artificially,
|
38 |
+
title={From Artificially Real to Real: Leveraging Pseudo Data from Large Language Models for Low-Resource Molecule Discovery},
|
39 |
+
author={Chen, Yuhan and Xi, Nuwa and Du, Yanrui and Wang, Haochun and Jianyu, Chen and Zhao, Sendong and Qin, Bing},
|
40 |
+
journal={arXiv preprint arXiv:2309.05203},
|
41 |
+
year={2023}
|
42 |
+
}
|
43 |
+
```
|