InstaDeepAI
/

isoformer

Model card Files Files and versions Community

pbordesinstadeep commited on Jun 10, 2024

Commit

f9ff480

·

verified ·

1 Parent(s): fc508dc

Update README.md

Files changed (1) hide show

README.md +31 -3

README.md CHANGED Viewed

@@ -1,3 +1,26 @@
 A small snippet of code is given here in order to retrieve embeddings and gene expression predictions given a DNA, RNA and protein sequence.
 ```python
@@ -6,8 +29,8 @@ import numpy as np
 import torch
 # Import the tokenizer and the model
-tokenizer = AutoTokenizer.from_pretrained("isoformer-anonymous/Isoformer", trust_remote_code=True)
-model = AutoModelForMaskedLM.from_pretrained("isoformer-anonymous/Isoformer",trust_remote_code=True)
 protein_sequences = ["RSRSRSRSRSRSRSRSRSRSRL" * 9]
 rna_sequences = ["ATTCCGGTTTTCA" * 9]
@@ -33,4 +56,9 @@ torch_output = model.forward(
 print(f"Gene expression predictions: {torch_output['gene_expression_predictions']}")
 print(f"Final DNA embedding: {torch_output['final_dna_embeddings']}")
-```

+---
+license: cc-by-nc-sa-4.0
+tags:
+- DNA
+- RNA
+- protein
+- biology
+- genomics
+datasets:
+- InstaDeepAI/multi_omics_transcript_expression
+---
+# Isoformer
+Isoformer is a model able to accurately predict differential transcript expression, outperforming existing methods and leveraging the use of multiple modalities.
+Our framework achieves efficient transfer knowledge from three pre-trained encoders: Enformer for the DNA modality, Nucleotide Transformer v2 for the RNA modality and ESM2 for the protein modality.
+**Developed by:** InstaDeep
+### How to use
 A small snippet of code is given here in order to retrieve embeddings and gene expression predictions given a DNA, RNA and protein sequence.
 ```python
 import torch
 # Import the tokenizer and the model
+tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/isoformer", trust_remote_code=True)
+model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/isoformer",trust_remote_code=True)
 protein_sequences = ["RSRSRSRSRSRSRSRSRSRSRL" * 9]
 rna_sequences = ["ATTCCGGTTTTCA" * 9]
 print(f"Gene expression predictions: {torch_output['gene_expression_predictions']}")
 print(f"Final DNA embedding: {torch_output['final_dna_embeddings']}")
+```
+## Training data
+Isoformer is trained on RNA transcript expression data obtained from the GTex portal, namely Transcript TPMs measurements across 30 tissues which come from more than 5000 individuals. In total, the dataset is made of ∼170k unique transcripts, of which 90k are protein-coding and correspond to ∼20k unique genes.