File size: 3,540 Bytes
7d164b1 3a84bb5 7d164b1 7f0b1b7 8d139f3 7f0b1b7 8d139f3 7d164b1 9ae7c78 7d164b1 9ae7c78 7d164b1 3a84bb5 7d164b1 3a84bb5 7d164b1 3a84bb5 7d164b1 3a84bb5 7d164b1 3a84bb5 7d164b1 3a84bb5 7d164b1 3a84bb5 7d164b1 1e63f9c 3a84bb5 7d164b1 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 |
---
license: mit
tags:
- biology
- medical
---
# Model Overview
Orthrus is a mature RNA foundation model for RNA property prediction. Orthrus is pre-trained using contrastive learning on 45M+ mature RNA transcripts to capture functional and evolutionary relationships across all Mammailian organisms. Orthrus is built on a Mamba encoder backbone, enabling the embedding of arbitrarily long RNA sequence data. We offer two sizes of Orthrus: `base` is trained using ~1M parameters, and `large` is trained using ~10M parameters.
Three versions of Orthrus are available for use via HuggingFace ([See collection](https://huggingface.co/collections/quietflamingo/orthrus-67902c204932687e24cc1c01)):
- [**Orthrus base 4-track**](https://huggingface.co/quietflamingo/orthrus-base-4-track): Encodes the mRNA sequence with a simplified one-hot approach.
- [**Orthrus large 4-track**](https://huggingface.co/quietflamingo/orthrus-large-4-track): Larger version of the above.
- [**Orthrus large 6-track**](https://huggingface.co/quietflamingo/orthrus-large-6-track): Adds biological context by including splice site indicators and coding sequence markers, which is crucial for accurate mRNA property prediction such as RNA half-life, ribosome load, and exon junction detection.
**This HF repo contains the `orthrus-base-4-track` model.**
Additional project files and the github repository can be found at:
- https://huggingface.co/antichronology/orthrus
- https://github.com/bowang-lab/Orthrus
# Using Orthrus (4-track)
To generate embeddings using Orthrus for spliced mature RNA sequences, follow the steps below:
> [!TIP]
> **NOTE:**
> Orthrus was trained and built to model full mature RNA sequences, so using incomplete pieces of spliced RNA as input will be out of distribution. This differs in usage compared to existing DNA / RNA foundation models which model arbitrary genomic segments.
## Create and Set Up the Environment
This environment setup is tested for using PyTorch 2.2.2 using CUDA 12.1.
1. Setup conda environment
```
conda create --name orthrus
conda activate orthrus
```
2. Install required dependencies
```
pip install torch==2.2.2 --index-url https://download.pytorch.org/whl/cu121
pip install causal_conv1d==1.2.0.post2
pip install mamba-ssm==1.2.0.post1
pip install transformers
```
## Load Orthrus from HuggingFace
```python
import torch
from transformers import AutoModel
device = torch.device("cuda")
orthrus_4 = AutoModel.from_pretrained(
"quietflamingo/orthrus-base-4-track",
trust_remote_code=True
).to(device)
```
## Get Sequence Embeddings
```python
sequence = "ATGATGATG"
seq_ohe = orthrus_4.seq_to_oh(sequence).to(device)
model_input_tt = seq_ohe.unsqueeze(0)
lengths = torch.Tensor([model_input_tt.shape[1]]).to(device)
embedding = orthrus_4.representation(
model_input_tt, # (1 x L x 4)
lengths, # (1,)
channel_last=True
)
print(embedding.shape) # (1 x 256)
```
An example of sequence embedding using Orthrus is shown in this [Colab notebook](https://colab.research.google.com/drive/1Rb6VC92YoKRPyF2LG4m8zIXjDszm1NZW?usp=sharing).
# Citation
```
@article{orthrus_fradkin_shi_2024,
title = {Orthrus: Towards Evolutionary and Functional RNA Foundation Models},
url = {http://dx.doi.org/10.1101/2024.10.10.617658},
DOI = {10.1101/2024.10.10.617658},
publisher = {Cold Spring Harbor Laboratory},
author = {Fradkin, Philip and Shi, Ruian and Isaev, Keren and Frey, Brendan J and Morris, Quaid and Lee, Leo J and Wang, Bo},
year = {2024},
month = oct
}
```
|