AIDO.RNA-1.6B / README.md
ShuxianZou's picture
Update README.md
9525981 verified
|
raw
history blame
3.39 kB

AIDO.RNA 1.6B

AIDO.RNA is a 1.6B parameter RNA foundation model trained on 42 million non-coding RNA sequences at single-nucleotide resolution. It achieves state-of-the-art performance on a comprehensive set of tasks, including RNA secondary structure prediction, mRNA-related tasks, RNA function prediction tasks, and RNA inverse folding.

description

Model architectural details

AIDO.RNA is an encoder-only transformer and is pre-trained using masked language modeling (MLM) objective. The model architecture parameters are as follows:

hyperparameter value
num-layers 32
hidden-size 2,048
ffn-hidden-size 5,440
num-attn-heads 32

Pre-training data

The pre-training data contains 42 million unique ncRNA sequences from RNAcentral version 24.0.

description

Downstream evaluation

description

How to Use

Build any downstream models from this backbone

Get RNA sequence embedding

from genbio_finetune.tasks import Embed
model = Embed.from_config({"model.backbone": "rnafm"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "ACGT"]})
embedding = model(collated_batch)
print(embedding.shape)
print(embedding)

Sequence-level classification

import torch
from genbio_finetune.tasks import SequenceClassification
model = SequenceClassification.from_config({"model.backbone": "rnafm", "model.n_classes": 2}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))

Token-level classification

import torch
from genbio_finetune.tasks import TokenClassification
model = TokenClassification.from_config({"model.backbone": "rnafm", "model.n_classes": 3}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)
print(torch.argmax(logits, dim=-1))

Pairwise token-level classification

@Sazan TODO

Sequence-level regression

from genbio_finetune.tasks import SequenceRegression
model = SequenceRegression.from_config({"model.backbone": "rnafm"}).eval()
collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
logits = model(collated_batch)
print(logits)

Or use our one-liner CLI to finetune or evaluate any of the above!

gbft fit --model SequenceClassification --model.backbone rnafm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
gbft test --model SequenceClassification --model.backbone rnafm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>

For more information, visit: Model Generator

Citation

Please cite AIDO.RNA using the following BibTeX code:

License

@Hongyi TODO