genbio-ai
/

AIDO.RNA-1.6B

Model card Files Files and versions Community

AIDO.RNA-1.6B / README.md

ShuxianZou's picture

Update README.md

e70cf7a verified 2 months ago

|

3.49 kB

	# AIDO.RNA 1.6B

	AIDO.RNA is a 1.6B parameter RNA foundation model trained on 42 million non-coding RNA sequences at single-nucleotide resolution. It achieves state-of-the-art performance on a comprehensive set of tasks, including RNA secondary structure prediction, mRNA-related tasks, RNA function prediction tasks, and RNA inverse folding.
	<p align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/63008d4bc1e149ceaff724a3/mNqn5SKQFHxSby3E2dosE.png" alt="description" style="width:80%; height:auto;">
	</p>

	## Model architectural details
	AIDO.RNA is an encoder-only transformer and is pre-trained using masked language modeling (MLM) objective. The model architecture parameters are as follows:
	\| hyperparameter \| value \|
	\| :---: \| :----: \|
	\| num-layers \| 32 \|
	\| hidden-size \| 2,048 \|
	\| ffn-hidden-size \| 5,440 \|
	\| num-attn-heads \| 32 \|
	\| vocab-size \| 16 \|


	## Pre-training data
	The pre-training data contains 42 million unique ncRNA sequences from RNAcentral version 24.0.
	<p align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/63008d4bc1e149ceaff724a3/EKvuUI9mBw5hkErzpXKm9.png" alt="description" style="width:90%; height:auto;">
	</p>

	## Downstream evaluation
	<p align="center">
	<img src="https://cdn-uploads.huggingface.co/production/uploads/63008d4bc1e149ceaff724a3/uvII1Q_1vDe95WCP1RgUV.png" alt="description" style="width:90%; height:auto;">
	</p>


	## How to Use
	Build any downstream models from this backbone

	### Get RNA sequence embedding
	```python
	from genbio_finetune.tasks import Embed
	model = Embed.from_config({"model.backbone": "rnafm"}).eval()
	collated_batch = model.collate({"sequences": ["ACGT", "ACGT"]})
	embedding = model(collated_batch)
	print(embedding.shape)
	print(embedding)
	```

	### Sequence-level classification
	```python
	import torch
	from genbio_finetune.tasks import SequenceClassification
	model = SequenceClassification.from_config({"model.backbone": "rnafm", "model.n_classes": 2}).eval()
	collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
	logits = model(collated_batch)
	print(logits)
	print(torch.argmax(logits, dim=-1))
	```

	### Token-level classification
	```python
	import torch
	from genbio_finetune.tasks import TokenClassification
	model = TokenClassification.from_config({"model.backbone": "rnafm", "model.n_classes": 3}).eval()
	collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
	logits = model(collated_batch)
	print(logits)
	print(torch.argmax(logits, dim=-1))
	```


	### Pairwise token-level classification
	@Sazan TODO


	### Sequence-level regression
	```python
	from genbio_finetune.tasks import SequenceRegression
	model = SequenceRegression.from_config({"model.backbone": "rnafm"}).eval()
	collated_batch = model.collate({"sequences": ["ACGT", "AGCT"]})
	logits = model(collated_batch)
	print(logits)
	```

	## RNA inverse folding
	@Sazan TODO

	Or use our one-liner CLI to finetune or evaluate any of the above!
	```
	gbft fit --model SequenceClassification --model.backbone rnafm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
	gbft test --model SequenceClassification --model.backbone rnafm --data SequenceClassification --data.path <hf_or_local_path_to_your_dataset>
	```

	For more information, visit: [Model Generator](https://github.com/genbio-ai/modelgenerator)

	## Citation
	Please cite AIDO.RNA using the following BibTeX code:


	## License
	@Hongyi TODO