Update README.md
Browse files
README.md
CHANGED
@@ -2,4 +2,186 @@
|
|
2 |
license: apache-2.0
|
3 |
tags:
|
4 |
- biology
|
5 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
license: apache-2.0
|
3 |
tags:
|
4 |
- biology
|
5 |
+
---
|
6 |
+
|
7 |
+
|
8 |
+
# scprint: Large Cell Model for scRNAseq data
|
9 |
+
|
10 |
+
[![PyPI version](https://badge.fury.io/py/scprint.svg)](https://badge.fury.io/py/scprint)
|
11 |
+
[![Documentation Status](https://readthedocs.org/projects/scprint/badge/?version=latest)](https://scprint.readthedocs.io/en/latest/?badge=latest)
|
12 |
+
[![Downloads](https://pepy.tech/badge/scprint)](https://pepy.tech/project/scprint)
|
13 |
+
[![Downloads](https://pepy.tech/badge/scprint/month)](https://pepy.tech/project/scprint)
|
14 |
+
[![Downloads](https://pepy.tech/badge/scprint/week)](https://pepy.tech/project/scprint)
|
15 |
+
[![GitHub issues](https://img.shields.io/github/issues/jkobject/scPRINT)](https://img.shields.io/github/issues/jkobject/scPRINT)
|
16 |
+
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
|
17 |
+
[![DOI](https://zenodo.org/badge/391909874.svg)]()
|
18 |
+
|
19 |
+
![logo](logo.png)
|
20 |
+
|
21 |
+
scPRINT is a large transformer model built for the inference of gene network (connections between genes explaining the cell's expression profile) from scRNAseq data.
|
22 |
+
|
23 |
+
It uses novel encoding and decoding of the cell expression profile as well as new pre-training methodologies to learn a cell model.
|
24 |
+
|
25 |
+
scPRINT can do lots of things:
|
26 |
+
|
27 |
+
- __expression denoising__: increase the resolution of your scRNAseq data
|
28 |
+
- __cell embedding__: generate a low-dimensional representation of your dataset
|
29 |
+
- __label prediction__: predict the cell type, disease, sequencer, sex, and ethnicity of your cells
|
30 |
+
- __gene network inference__: generate a gene network from any cell or cell cluster in your scRNAseq dataset
|
31 |
+
|
32 |
+
[Read the paper!]() if you want to know more about scPRINT.
|
33 |
+
|
34 |
+
![figure1](figure1.png)
|
35 |
+
|
36 |
+
## Install it from PyPI
|
37 |
+
|
38 |
+
If you want to be using flashattention2, know that it only supports triton 2.0 MLIR's version and torch==2.0.0 for now.
|
39 |
+
|
40 |
+
👷 WIP ...
|
41 |
+
|
42 |
+
<!---
|
43 |
+
|
44 |
+
```bash
|
45 |
+
pip install 'lamindb[jupyter,bionty]'
|
46 |
+
```
|
47 |
+
|
48 |
+
then install scPrint
|
49 |
+
|
50 |
+
```bash
|
51 |
+
pip install scprint
|
52 |
+
```
|
53 |
+
> if you have a GPU that you want to use, you will benefit from flashattention. and you will have to do some more specific installs:
|
54 |
+
|
55 |
+
1. find the version of torch 2.0.0 / torchvision 0.15.0 / torchaudio 2.0.0 that match your nvidia drivers on the torch website.
|
56 |
+
2. apply the install command
|
57 |
+
3. do `pip install pytorch-fast-transformers torchtext==0.15.1`
|
58 |
+
4. do `pip install triton==2.0.0.dev20221202 --no-deps`
|
59 |
+
|
60 |
+
You should be good to go. You need those specific versions for everything to work...
|
61 |
+
|
62 |
+
This is not my fault, scream at nvidia :wink:
|
63 |
+
-->
|
64 |
+
|
65 |
+
## Install it in dev mode
|
66 |
+
|
67 |
+
For the moment scPRINT has been tested on MacOS and Linux (Ubuntu 20.04) with Python 3.10.
|
68 |
+
|
69 |
+
If you want to be using flashattention2, know that it only supports triton 2.0 MLIR's version and torch==2.0.0 for now.
|
70 |
+
|
71 |
+
|
72 |
+
```python
|
73 |
+
conda create -n "[whatever]" python==3.10
|
74 |
+
git clone https://github.com/jkcobject/scPRINT
|
75 |
+
git clone https://github.com/jkobject/GRnnData
|
76 |
+
git clone https://github.com/jkobject/benGRN
|
77 |
+
cd scPRINT
|
78 |
+
git submodule init
|
79 |
+
git submodule update
|
80 |
+
pip install 'lamindb[jupyter,bionty]'
|
81 |
+
pip install -e scDataloader
|
82 |
+
pip install -e ../GRnnData/
|
83 |
+
pip install -e ../benGRN/
|
84 |
+
pip install torch==2.0.0 torchvision==0.15.1 torchaudio==2.0.1
|
85 |
+
# install the dev tooling if you need it too
|
86 |
+
pip install -e ".[dev]"
|
87 |
+
pip install -r requirements-dev.txt
|
88 |
+
pip install triton==2.0.0.dev20221202 --no-deps # only if you have a compatible gpu (e.g. not available for apple GPUs for now, see https://github.com/triton-lang/triton?tab=readme-ov-file#compatibility)
|
89 |
+
# install triton as mentioned in .toml if you want to
|
90 |
+
mkdocs serve # to view the dev documentation
|
91 |
+
```
|
92 |
+
|
93 |
+
We use additional packages we developped, refer to their documentation for more information:
|
94 |
+
|
95 |
+
- [scDataLoader](https://github.com/jkobject/scDataLoader): a dataloader for training large cell models.
|
96 |
+
- [GRnnData](https://github.com/cantinilab/GRnnData): a package to work with gene networks from single cell data.
|
97 |
+
- [benGRN](https://github.com/jkobject/benGRN): a package to benchmark gene network inference methods from single cell data.
|
98 |
+
|
99 |
+
### lamin.ai
|
100 |
+
|
101 |
+
⚠️ if you want to use the scDataloader's multi dataset mode or if you want to preprocess datasets and other functions of the model, you will need to use lamin.ai.
|
102 |
+
|
103 |
+
In that case connect with google or github to [lamin.ai](https://lamin.ai/login), then be sure to connect before running anything (or before starting a notebook): `lamin login <email> --key <API-key>`. Follow the instructions on [their website](https://docs.lamin.ai/guide).
|
104 |
+
|
105 |
+
## Usage
|
106 |
+
|
107 |
+
### scPRINT's basic commands
|
108 |
+
|
109 |
+
This is the most minimal example of how scprint gets used:
|
110 |
+
|
111 |
+
```py
|
112 |
+
from lightning.pytorch import Trainer
|
113 |
+
from scprint import scPrint
|
114 |
+
from scdataloader import DataModule
|
115 |
+
|
116 |
+
datamodule = DataModule(...)
|
117 |
+
model = scPrint(...)
|
118 |
+
trainer = Trainer(...)
|
119 |
+
trainer.fit(model, datamodule=datamodule)
|
120 |
+
...
|
121 |
+
```
|
122 |
+
|
123 |
+
or
|
124 |
+
|
125 |
+
```bash
|
126 |
+
$ scprint fit/train/predict/test --config config/[medium|large|vlarge] ...
|
127 |
+
```
|
128 |
+
|
129 |
+
### Notes on GPU/CPU usage with triton
|
130 |
+
|
131 |
+
If you do not have [triton](https://triton-lang.org/main/python-api/triton.html) installed you will not be able to take advantage of gpu acceleration, but you can still use the model on the cpu.
|
132 |
+
|
133 |
+
In that case, if loading from a checkpoint that was trained with flashattention, you will need to specify `transformer="normal"` in the `load_from_checkpoint` function like so:
|
134 |
+
|
135 |
+
```python
|
136 |
+
model = scPrint.load_from_checkpoint(
|
137 |
+
'../data/temp/last.ckpt', precpt_gene_emb=None,
|
138 |
+
transformer="normal")
|
139 |
+
```
|
140 |
+
|
141 |
+
We now explore the different usages of scPRINT:
|
142 |
+
|
143 |
+
### I want to generate gene networks from scRNAseq data:
|
144 |
+
|
145 |
+
-> refer to the section 1. gene network inference in [this notebook](./notebooks/cancer_usecase.ipynb#).
|
146 |
+
|
147 |
+
-> more examples in this notebook [./notebooks/assessments/bench_omni.ipynb](./notebooks/assessments/bench_omni.ipynb).
|
148 |
+
|
149 |
+
### I want to generate cell embeddings and cell label predictions from scRNAseq data:
|
150 |
+
|
151 |
+
-> refer to the embeddings and cell annotations section in [this notebook](./notebooks/cancer_usecase.ipynb).
|
152 |
+
|
153 |
+
### I want to denoising my scRNAseq dataset:
|
154 |
+
|
155 |
+
-> refer to the Denoising of B-cell section in [this notebook](./notebooks/cancer_usecase.ipynb).
|
156 |
+
|
157 |
+
-> More example in our benchmark notebook [./notebooks/assessments/bench_denoising.ipynb](./notebooks/assessments/bench_denoising.ipynb).
|
158 |
+
|
159 |
+
### I want to generate an atlas level embedding
|
160 |
+
|
161 |
+
-> refer to the notebook [nice_umap.ipynb](./figures/nice_umap.ipynb).
|
162 |
+
|
163 |
+
### Documentation
|
164 |
+
|
165 |
+
/!\ WIP /!\
|
166 |
+
|
167 |
+
<!--
|
168 |
+
for more information on usage please see the documentation in [https://www.jkobject.com/scPrint/](https://www.jkobject.com/scPrint/)
|
169 |
+
|
170 |
+
-->
|
171 |
+
|
172 |
+
### Model Weights
|
173 |
+
|
174 |
+
Model weights are available on [hugging face](https://huggingface.co/jkobject).
|
175 |
+
|
176 |
+
## Development
|
177 |
+
|
178 |
+
Read the [CONTRIBUTING.md](CONTRIBUTING.md) file.
|
179 |
+
|
180 |
+
Read the [training runs](https://wandb.ai/ml4ig/scprint_scale/reports/scPRINT-trainings--Vmlldzo4ODIxMjgx?accessToken=80metwx7b08hhourotpskdyaxiflq700xzmzymr6scvkp69agybt79l341tv68hp) document to know more about how training was performed and the results there.
|
181 |
+
|
182 |
+
acknowledgement:
|
183 |
+
[python template](https://github.com/rochacbruno/python-project-template)
|
184 |
+
[laminDB](https://lamin.ai/)
|
185 |
+
[lightning](https://lightning.ai/)
|
186 |
+
|
187 |
+
Awesome Large Cell Model created by Jeremie Kalfon.
|