File size: 3,393 Bytes
8e4c61d
 
f5d1e99
2a76b05
7f4d028
2a76b05
8e4c61d
f5d1e99
9525981
 
 
 
 
 
 
 
8e4c61d
f5d1e99
9525981
 
 
 
8e4c61d
f5d1e99
2a76b05
 
 
6e34ac7
a7d67c5
f5d1e99
 
 
 
 
 
f5b4139
f5d1e99
 
 
 
 
8e4c61d
f5d1e99
a7d67c5
 
 
f5b4139
 
a7d67c5
 
 
 
8e4c61d
f5d1e99
 
 
 
f5b4139
 
f5d1e99
 
 
 
8e4c61d
 
f5d1e99
 
 
 
 
 
 
f5b4139
 
f5d1e99
 
 
 
 
 
 
 
 
 
f5b4139
f5d1e99
8e4c61d
f5d1e99
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# AIDO.RNA 1.6B

AIDO.RNA is a 1.6B parameter RNA foundation model trained on 42 million non-coding RNA sequences at single-nucleotide resolution. It achieves state-of-the-art performance on a comprehensive set of tasks, including RNA secondary structure prediction, mRNA-related tasks, RNA function prediction tasks, and RNA inverse folding.
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/63008d4bc1e149ceaff724a3/mNqn5SKQFHxSby3E2dosE.png" alt="description" style="width:80%; height:auto;">
</p>

## Model architectural details
AIDO.RNA is an encoder-only transformer and is pre-trained using masked language modeling (MLM) objective. The model architecture parameters are as follows:
|   hyperparameter  |  value     |
| :---:             |    :----:  |
| num-layers        | 32         |
| hidden-size       | 2,048      |
| ffn-hidden-size   | 5,440      |
| num-attn-heads    | 32         |


## Pre-training data
The pre-training data contains 42 million unique ncRNA sequences from RNAcentral version 24.0. 
<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/63008d4bc1e149ceaff724a3/EKvuUI9mBw5hkErzpXKm9.png" alt="description" style="width:100%; height:auto;">
</p>

## Downstream evaluation
<p align="center">
  <img src="https://cdn-uploads.huggingface.co/production/uploads/63008d4bc1e149ceaff724a3/uvII1Q_1vDe95WCP1RgUV.png" alt="description" style="width:90%; height:auto;">
</p>


## How to Use
Build any downstream models from this backbone

### Get RNA sequence embedding
```
from genbio_finetune.tasks import Embed
model = Embed.from_config({"model.backbone": "rnafm"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "ACGT"]})
embedding = model(collated_batch)
print(embedding.shape)
print(embedding)
```

### Sequence-level classification
```
import torch
from genbio_finetune.tasks import SequenceClassification
model = SequenceClassification.from_config({"model.backbone": "rnafm", "model.n_classes": 2}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
```

### Token-level classification
```
import torch
from genbio_finetune.tasks import TokenClassification
model = TokenClassification.from_config({"model.backbone": "rnafm", "model.n_classes": 3}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
```


### Pairwise token-level classification
@Sazan TODO


### Sequence-level regression
```
from genbio_finetune.tasks import SequenceRegression
model = SequenceRegression.from_config({"model.backbone": "rnafm"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
```

Or use our one-liner CLI to finetune or evaluate any of the above!
```
gbft fit --model SequenceClassification --model.backbone rnafm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
gbft test --model SequenceClassification --model.backbone rnafm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
```

For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)

## Citation
Please cite AIDO.RNA using the following BibTeX code:


## License
@Hongyi TODO