Update README.md
Browse filesgeneforemr model card
README.md
CHANGED
@@ -8,87 +8,35 @@ tags:
|
|
8 |
# Geneformer
|
9 |
Geneformer is a foundational transformer model pretrained on a large-scale corpus of single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology.
|
10 |
|
11 |
-
|
12 |
-
|
13 |
-
- See [geneformer.readthedocs.io](https://geneformer.readthedocs.io) for documentation.
|
14 |
|
15 |
-
#
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
20 |
-
|
21 |
-
|
22 |
-
|
23 |
-
|
24 |
-
|
25 |
-
|
26 |
-
|
27 |
-
|
28 |
-
|
29 |
-
|
30 |
-
|
31 |
-
|
32 |
-
|
33 |
-
-
|
34 |
-
|
35 |
-
|
36 |
-
|
37 |
-
|
38 |
-
The current default model in the main directory of the repository is GF-12L-95M-i4096.
|
39 |
-
|
40 |
-
The repository also contains fined tuned models in the fine_tuned_models directory and the cancer-tuned model following continual learning on ~14 million cancer cells, GF-12L-95M-i4096_CLcancer.
|
41 |
-
|
42 |
-
# Application
|
43 |
-
The pretrained Geneformer model can be used directly for zero-shot learning, for example for in silico perturbation analysis, or by fine-tuning towards the relevant downstream task, such as gene or cell state classification.
|
44 |
-
|
45 |
-
Example applications demonstrated in [our manuscript](https://rdcu.be/ddrx0) include:
|
46 |
-
|
47 |
-
*Fine-tuning*:
|
48 |
-
- transcription factor dosage sensitivity
|
49 |
-
- chromatin dynamics (bivalently marked promoters)
|
50 |
-
- transcription factor regulatory range
|
51 |
-
- gene network centrality
|
52 |
-
- transcription factor targets
|
53 |
-
- cell type annotation
|
54 |
-
- batch integration
|
55 |
-
- cell state classification across differentiation
|
56 |
-
- disease classification
|
57 |
-
- in silico perturbation to determine disease-driving genes
|
58 |
-
- in silico treatment to determine candidate therapeutic targets
|
59 |
-
|
60 |
-
*Zero-shot learning*:
|
61 |
-
- batch integration
|
62 |
-
- gene context specificity
|
63 |
-
- in silico reprogramming
|
64 |
-
- in silico differentiation
|
65 |
-
- in silico perturbation to determine impact on cell state
|
66 |
-
- in silico perturbation to determine transcription factor targets
|
67 |
-
- in silico perturbation to determine transcription factor cooperativity
|
68 |
-
|
69 |
-
# Installation
|
70 |
-
In addition to the pretrained model, contained herein are functions for tokenizing and collating data specific to single cell transcriptomics, pretraining the model, fine-tuning the model, extracting and plotting cell embeddings, and performing in silico pertrubation with either the pretrained or fine-tuned models. To install (~20s):
|
71 |
-
|
72 |
-
```bash
|
73 |
-
# Make sure you have git-lfs installed (https://git-lfs.com)
|
74 |
-
git lfs install
|
75 |
-
git clone https://huggingface.co/ctheodoris/Geneformer
|
76 |
-
cd Geneformer
|
77 |
-
pip install .
|
78 |
```
|
79 |
-
|
80 |
-
For usage, see [examples](https://huggingface.co/ctheodoris/Geneformer/tree/main/examples) for:
|
81 |
-
- tokenizing transcriptomes
|
82 |
-
- pretraining
|
83 |
-
- hyperparameter tuning
|
84 |
-
- fine-tuning
|
85 |
-
- extracting and plotting cell embeddings
|
86 |
-
- in silico perturbation
|
87 |
-
|
88 |
-
Please note that the fine-tuning examples are meant to be generally applicable and the input datasets and labels will vary dependent on the downstream task. Example input files for a few of the downstream tasks demonstrated in the manuscript are located within the [example_input_files directory](https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files) in the dataset repository, but these only represent a few example fine-tuning applications.
|
89 |
-
|
90 |
-
Please note that GPU resources are required for efficient usage of Geneformer. Additionally, we strongly recommend tuning hyperparameters for each downstream fine-tuning application as this can significantly boost predictive potential in the downstream task (e.g. max learning rate, learning schedule, number of layers to freeze, etc.).
|
91 |
|
92 |
# Citations
|
|
|
93 |
- C V Theodoris#, L Xiao, A Chopra, M D Chaffin, Z R Al Sayed, M C Hill, H Mantineo, E Brydon, Z Zeng, X S Liu, P T Ellinor#. Transfer learning enables predictions in network biology. _**Nature**_, 31 May 2023. (#co-corresponding authors)
|
94 |
- H Chen*, M S Venkatesh*, J Gomez Ortega, S V Mahesh, T Nandi, R Madduri, K Pelka†, C V Theodoris†#. Quantized multi-task learning for context-specific representations of gene network dynamics. _**bioRxiv**_, 19 Aug 2024. (*co-first authors, †co-senior authors, #corresponding author)
|
|
|
8 |
# Geneformer
|
9 |
Geneformer is a foundational transformer model pretrained on a large-scale corpus of single cell transcriptomes to enable context-aware predictions in settings with limited data in network biology.
|
10 |
|
11 |
+
# Abstract
|
12 |
+
Mapping gene networks requires large amounts of transcriptomic data to learn the connections between genes, which impedes discoveries in settings with limited data, including rare diseases and diseases affecting clinically inaccessible tissues. Recently, transfer learning has revolutionized fields such as natural language understanding1,2 and computer vision3 by leveraging deep learning models pretrained on large-scale general datasets that can then be fine-tuned towards a vast array of downstream tasks with limited task-specific data. Here, we developed a context-aware, attention-based deep learning model, Geneformer, pretrained on a large-scale corpus of about 30 million single-cell transcriptomes to enable context-specific predictions in settings with limited data in network biology. During pretraining, Geneformer gained a fundamental understanding of network dynamics, encoding network hierarchy in the attention weights of the model in a completely self-supervised manner. Fine-tuning towards a diverse panel of downstream tasks relevant to chromatin and network dynamics using limited task-specific data demonstrated that Geneformer consistently boosted predictive accuracy. Applied to disease modelling with limited patient data, Geneformer identified candidate therapeutic targets for cardiomyopathy. Overall, Geneformer represents a pretrained deep learning model from which fine-tuning towards a broad range of downstream applications can be pursued to accelerate discovery of key network regulators and candidate therapeutic targets.
|
|
|
13 |
|
14 |
+
# Code
|
15 |
+
```
|
16 |
+
from tdc.model_server.tokenizers.geneformer import GeneformerTokenizer
|
17 |
+
from tdc import tdc_hf_interface
|
18 |
+
import torch
|
19 |
+
# Retrieve anndata object. Then,
|
20 |
+
tokenizer = GeneformerTokenizer()
|
21 |
+
x = tokenizer.tokenize_cell_vectors(adata,
|
22 |
+
ensembl_id="feature_id",
|
23 |
+
ncounts="n_measured_vars")
|
24 |
+
cells, _ = x
|
25 |
+
input_tensor = torch.tensor(cells) # note that you may need to pad or perform other custom data processing
|
26 |
+
attention_mask = torch.tensor(
|
27 |
+
[[x[0] != 0, x[1] != 0] for x in input_tensor]) # here we assume we used 0/False as a special padding token
|
28 |
+
outputs = model(batch,
|
29 |
+
attention_mask=attention_mask,
|
30 |
+
output_hidden_states=True)
|
31 |
+
layer_to_quant = quant_layers(model) + (
|
32 |
+
-1
|
33 |
+
) # Geneformer's second-to-last layer is most generalized
|
34 |
+
embs_i = outputs.hidden_states[layer_to_quant]
|
35 |
+
# there are "cls", "cell", and "gene" embeddings. we will only capture "gene", which is cell type specific. for "cell", you'd average out across unmasked gene embeddings per cell
|
36 |
+
embs = embs_i
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
38 |
|
39 |
# Citations
|
40 |
+
- A Velez-Arce#, Kexin Huang, Michelle M. Li, Xiang Lin, Wenhao Gao, Tianfan Fu, Manolis Kellis, Bradley L. Pentelute, and Marinka Zitnik.# Signals in the Cells: Multimodal and Contextualized Machine Learning Foundations for Therapeutics. _**bioRxiv**_, 12 Nov 2024. (#co-corresponding authors)
|
41 |
- C V Theodoris#, L Xiao, A Chopra, M D Chaffin, Z R Al Sayed, M C Hill, H Mantineo, E Brydon, Z Zeng, X S Liu, P T Ellinor#. Transfer learning enables predictions in network biology. _**Nature**_, 31 May 2023. (#co-corresponding authors)
|
42 |
- H Chen*, M S Venkatesh*, J Gomez Ortega, S V Mahesh, T Nandi, R Madduri, K Pelka†, C V Theodoris†#. Quantized multi-task learning for context-specific representations of gene network dynamics. _**bioRxiv**_, 19 Aug 2024. (*co-first authors, †co-senior authors, #corresponding author)
|