jkobject commited on
Commit
905237e
·
verified ·
1 Parent(s): d6daa62

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +183 -1
README.md CHANGED
@@ -2,4 +2,186 @@
2
  license: apache-2.0
3
  tags:
4
  - biology
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
3
  tags:
4
  - biology
5
+ ---
6
+
7
+
8
+ # scprint: Large Cell Model for scRNAseq data
9
+
10
+ [![PyPI version](https://badge.fury.io/py/scprint.svg)](https://badge.fury.io/py/scprint)
11
+ [![Documentation Status](https://readthedocs.org/projects/scprint/badge/?version=latest)](https://scprint.readthedocs.io/en/latest/?badge=latest)
12
+ [![Downloads](https://pepy.tech/badge/scprint)](https://pepy.tech/project/scprint)
13
+ [![Downloads](https://pepy.tech/badge/scprint/month)](https://pepy.tech/project/scprint)
14
+ [![Downloads](https://pepy.tech/badge/scprint/week)](https://pepy.tech/project/scprint)
15
+ [![GitHub issues](https://img.shields.io/github/issues/jkobject/scPRINT)](https://img.shields.io/github/issues/jkobject/scPRINT)
16
+ [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
17
+ [![DOI](https://zenodo.org/badge/391909874.svg)]()
18
+
19
+ ![logo](logo.png)
20
+
21
+ scPRINT is a large transformer model built for the inference of gene network (connections between genes explaining the cell's expression profile) from scRNAseq data.
22
+
23
+ It uses novel encoding and decoding of the cell expression profile as well as new pre-training methodologies to learn a cell model.
24
+
25
+ scPRINT can do lots of things:
26
+
27
+ - __expression denoising__: increase the resolution of your scRNAseq data
28
+ - __cell embedding__: generate a low-dimensional representation of your dataset
29
+ - __label prediction__: predict the cell type, disease, sequencer, sex, and ethnicity of your cells
30
+ - __gene network inference__: generate a gene network from any cell or cell cluster in your scRNAseq dataset
31
+
32
+ [Read the paper!]() if you want to know more about scPRINT.
33
+
34
+ ![figure1](figure1.png)
35
+
36
+ ## Install it from PyPI
37
+
38
+ If you want to be using flashattention2, know that it only supports triton 2.0 MLIR's version and torch==2.0.0 for now.
39
+
40
+ 👷 WIP ...
41
+
42
+ <!---
43
+
44
+ ```bash
45
+ pip install 'lamindb[jupyter,bionty]'
46
+ ```
47
+
48
+ then install scPrint
49
+
50
+ ```bash
51
+ pip install scprint
52
+ ```
53
+ > if you have a GPU that you want to use, you will benefit from flashattention. and you will have to do some more specific installs:
54
+
55
+ 1. find the version of torch 2.0.0 / torchvision 0.15.0 / torchaudio 2.0.0 that match your nvidia drivers on the torch website.
56
+ 2. apply the install command
57
+ 3. do `pip install pytorch-fast-transformers torchtext==0.15.1`
58
+ 4. do `pip install triton==2.0.0.dev20221202 --no-deps`
59
+
60
+ You should be good to go. You need those specific versions for everything to work...
61
+
62
+ This is not my fault, scream at nvidia :wink:
63
+ -->
64
+
65
+ ## Install it in dev mode
66
+
67
+ For the moment scPRINT has been tested on MacOS and Linux (Ubuntu 20.04) with Python 3.10.
68
+
69
+ If you want to be using flashattention2, know that it only supports triton 2.0 MLIR's version and torch==2.0.0 for now.
70
+
71
+
72
+ ```python
73
+ conda create -n "[whatever]" python==3.10
74
+ git clone https://github.com/jkcobject/scPRINT
75
+ git clone https://github.com/jkobject/GRnnData
76
+ git clone https://github.com/jkobject/benGRN
77
+ cd scPRINT
78
+ git submodule init
79
+ git submodule update
80
+ pip install 'lamindb[jupyter,bionty]'
81
+ pip install -e scDataloader
82
+ pip install -e ../GRnnData/
83
+ pip install -e ../benGRN/
84
+ pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1
85
+ # install the dev tooling if you need it too
86
+ pip install -e ".[dev]"
87
+ pip install -r requirements-dev.txt
88
+ pip install triton==2.0.0.dev20221202 --no-deps # only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility)
89
+ # install triton as mentioned in .toml if you want to
90
+ mkdocs serve # to view the dev documentation
91
+ ```
92
+
93
+ We use additional packages we developped, refer to their documentation for more information:
94
+
95
+ - [scDataLoader](https://github.com/jkobject/scDataLoader): a dataloader for training large cell models.
96
+ - [GRnnData](https://github.com/cantinilab/GRnnData): a package to work with gene networks from single cell data.
97
+ - [benGRN](https://github.com/jkobject/benGRN): a package to benchmark gene network inference methods from single cell data.
98
+
99
+ ### lamin.ai
100
+
101
+ ⚠️ if you want to use the scDataloader's multi dataset mode or if you want to preprocess datasets and other functions of the model, you will need to use lamin.ai.
102
+
103
+ In that case connect with google or github to [lamin.ai](https://lamin.ai/login), then be sure to connect before running anything (or before starting a notebook): `lamin login <email> --key <API-key>`. Follow the instructions on [their website](https://docs.lamin.ai/guide).
104
+
105
+ ## Usage
106
+
107
+ ### scPRINT's basic commands
108
+
109
+ This is the most minimal example of how scprint gets used:
110
+
111
+ ```py
112
+ from lightning.pytorch import Trainer
113
+ from scprint import scPrint
114
+ from scdataloader import DataModule
115
+
116
+ datamodule = DataModule(...)
117
+ model = scPrint(...)
118
+ trainer = Trainer(...)
119
+ trainer.fit(model, datamodule=datamodule)
120
+ ...
121
+ ```
122
+
123
+ or
124
+
125
+ ```bash
126
+ $ scprint fit/train/predict/test --config config/[medium|large|vlarge] ...
127
+ ```
128
+
129
+ ### Notes on GPU/CPU usage with triton
130
+
131
+ If you do not have [triton](https://triton-lang.org/main/python-api/triton.html) installed you will not be able to take advantage of gpu acceleration, but you can still use the model on the cpu.
132
+
133
+ In that case, if loading from a checkpoint that was trained with flashattention, you will need to specify `transformer="normal"` in the `load_from_checkpoint` function like so:
134
+
135
+ ```python
136
+ model = scPrint.load_from_checkpoint(
137
+ '../data/temp/last.ckpt', precpt_gene_emb=None,
138
+ transformer="normal")
139
+ ```
140
+
141
+ We now explore the different usages of scPRINT:
142
+
143
+ ### I want to generate gene networks from scRNAseq data:
144
+
145
+ -> refer to the section 1. gene network inference in [this notebook](./notebooks/cancer_usecase.ipynb#).
146
+
147
+ -> more examples in this notebook [./notebooks/assessments/bench_omni.ipynb](./notebooks/assessments/bench_omni.ipynb).
148
+
149
+ ### I want to generate cell embeddings and cell label predictions from scRNAseq data:
150
+
151
+ -> refer to the embeddings and cell annotations section in [this notebook](./notebooks/cancer_usecase.ipynb).
152
+
153
+ ### I want to denoising my scRNAseq dataset:
154
+
155
+ -> refer to the Denoising of B-cell section in [this notebook](./notebooks/cancer_usecase.ipynb).
156
+
157
+ -> More example in our benchmark notebook [./notebooks/assessments/bench_denoising.ipynb](./notebooks/assessments/bench_denoising.ipynb).
158
+
159
+ ### I want to generate an atlas level embedding
160
+
161
+ -> refer to the notebook [nice_umap.ipynb](./figures/nice_umap.ipynb).
162
+
163
+ ### Documentation
164
+
165
+ /!\ WIP /!\
166
+
167
+ <!--
168
+ for more information on usage please see the documentation in [https://www.jkobject.com/scPrint/](https://www.jkobject.com/scPrint/)
169
+
170
+ -->
171
+
172
+ ### Model Weights
173
+
174
+ Model weights are available on [hugging face](https://huggingface.co/jkobject).
175
+
176
+ ## Development
177
+
178
+ Read the [CONTRIBUTING.md](CONTRIBUTING.md) file.
179
+
180
+ Read the [training runs](https://wandb.ai/ml4ig/scprint_scale/reports/scPRINT-trainings--Vmlldzo4ODIxMjgx?accessToken=80metwx7b08hhourotpskdyaxiflq700xzmzymr6scvkp69agybt79l341tv68hp) document to know more about how training was performed and the results there.
181
+
182
+ acknowledgement:
183
+ [python template](https://github.com/rochacbruno/python-project-template)
184
+ [laminDB](https://lamin.ai/)
185
+ [lightning](https://lightning.ai/)
186
+
187
+ Awesome Large Cell Model created by Jeremie Kalfon.