pbordesinstadeep commited on
Commit
f9ff480
·
verified ·
1 Parent(s): fc508dc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +31 -3
README.md CHANGED
@@ -1,3 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  A small snippet of code is given here in order to retrieve embeddings and gene expression predictions given a DNA, RNA and protein sequence.
2
 
3
  ```python
@@ -6,8 +29,8 @@ import numpy as np
6
  import torch
7
 
8
  # Import the tokenizer and the model
9
- tokenizer = AutoTokenizer.from_pretrained("isoformer-anonymous/Isoformer", trust_remote_code=True)
10
- model = AutoModelForMaskedLM.from_pretrained("isoformer-anonymous/Isoformer",trust_remote_code=True)
11
 
12
  protein_sequences = ["RSRSRSRSRSRSRSRSRSRSRL" * 9]
13
  rna_sequences = ["ATTCCGGTTTTCA" * 9]
@@ -33,4 +56,9 @@ torch_output = model.forward(
33
  print(f"Gene expression predictions: {torch_output['gene_expression_predictions']}")
34
  print(f"Final DNA embedding: {torch_output['final_dna_embeddings']}")
35
 
36
- ```
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ tags:
4
+ - DNA
5
+ - RNA
6
+ - protein
7
+ - biology
8
+ - genomics
9
+ datasets:
10
+ - InstaDeepAI/multi_omics_transcript_expression
11
+ ---
12
+
13
+ # Isoformer
14
+
15
+ Isoformer is a model able to accurately predict differential transcript expression, outperforming existing methods and leveraging the use of multiple modalities.
16
+ Our framework achieves efficient transfer knowledge from three pre-trained encoders: Enformer for the DNA modality, Nucleotide Transformer v2 for the RNA modality and ESM2 for the protein modality.
17
+
18
+ **Developed by:** InstaDeep
19
+
20
+
21
+ ### How to use
22
+
23
+
24
  A small snippet of code is given here in order to retrieve embeddings and gene expression predictions given a DNA, RNA and protein sequence.
25
 
26
  ```python
 
29
  import torch
30
 
31
  # Import the tokenizer and the model
32
+ tokenizer = AutoTokenizer.from_pretrained("InstaDeepAI/isoformer", trust_remote_code=True)
33
+ model = AutoModelForMaskedLM.from_pretrained("InstaDeepAI/isoformer",trust_remote_code=True)
34
 
35
  protein_sequences = ["RSRSRSRSRSRSRSRSRSRSRL" * 9]
36
  rna_sequences = ["ATTCCGGTTTTCA" * 9]
 
56
  print(f"Gene expression predictions: {torch_output['gene_expression_predictions']}")
57
  print(f"Final DNA embedding: {torch_output['final_dna_embeddings']}")
58
 
59
+ ```
60
+
61
+
62
+ ## Training data
63
+
64
+ Isoformer is trained on RNA transcript expression data obtained from the GTex portal, namely Transcript TPMs measurements across 30 tissues which come from more than 5000 individuals. In total, the dataset is made of ∼170k unique transcripts, of which 90k are protein-coding and correspond to ∼20k unique genes.