AIDO.RNA-1.6B / README.md
ShuxianZou's picture
Update README.md
e70cf7a verified
|
raw
history blame
3.49 kB
# AIDO.RNA 1.6B
AIDO.RNA is a 1.6B parameter RNA foundation model trained on 42 million non-coding RNA sequences at single-nucleotide resolution. It achieves state-of-the-art performance on a comprehensive set of tasks, including RNA secondary structure prediction, mRNA-related tasks, RNA function prediction tasks, and RNA inverse folding.
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/63008d4bc1e149ceaff724a3/mNqn5SKQFHxSby3E2dosE.png" alt="description" style="width:80%; height:auto;">
</p>
## Model architectural details
AIDO.RNA is an encoder-only transformer and is pre-trained using masked language modeling (MLM) objective. The model architecture parameters are as follows:
| hyperparameter | value |
| :---: | :----: |
| num-layers | 32 |
| hidden-size | 2,048 |
| ffn-hidden-size | 5,440 |
| num-attn-heads | 32 |
| vocab-size | 16 |
## Pre-training data
The pre-training data contains 42 million unique ncRNA sequences from RNAcentral version 24.0.
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/63008d4bc1e149ceaff724a3/EKvuUI9mBw5hkErzpXKm9.png" alt="description" style="width:90%; height:auto;">
</p>
## Downstream evaluation
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/63008d4bc1e149ceaff724a3/uvII1Q_1vDe95WCP1RgUV.png" alt="description" style="width:90%; height:auto;">
</p>
## How to Use
Build any downstream models from this backbone
### Get RNA sequence embedding
```python
from genbio_finetune.tasks import Embed
model = Embed.from_config({"model.backbone": "rnafm"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "ACGT"]})
embedding = model(collated_batch)
print(embedding.shape)
print(embedding)
```
### Sequence-level classification
```python
import torch
from genbio_finetune.tasks import SequenceClassification
model = SequenceClassification.from_config({"model.backbone": "rnafm", "model.n_classes": 2}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
```
### Token-level classification
```python
import torch
from genbio_finetune.tasks import TokenClassification
model = TokenClassification.from_config({"model.backbone": "rnafm", "model.n_classes": 3}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))
```
### Pairwise token-level classification
@Sazan TODO
### Sequence-level regression
```python
from genbio_finetune.tasks import SequenceRegression
model = SequenceRegression.from_config({"model.backbone": "rnafm"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
```
## RNA inverse folding
@Sazan TODO
Or use our one-liner CLI to finetune or evaluate any of the above!
```
gbft fit --model SequenceClassification --model.backbone rnafm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
gbft test --model SequenceClassification --model.backbone rnafm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
```
For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)
## Citation
Please cite AIDO.RNA using the following BibTeX code:
## License
@Hongyi TODO