KeTuTu's picture
Upload 46 files
2999286 verified
#!/usr/bin/env python
# coding: utf-8
# # Data integration and batch correction with SIMBA
#
# Here we will use three scRNA-seq human pancreas datasets of different studies as an example to illustrate how SIMBA performs scRNA-seq batch correction for multiple batches
#
# We follow the corresponding tutorial at [SIMBA](https://simba-bio.readthedocs.io/en/latest/rna_human_pancreas.html). We do not provide much explanation, and instead refer to the original tutorial.
#
# Paper: [SIMBA: single-cell embedding along with features](https://www.nature.com/articles/s41592-023-01899-8)
#
# Code: https://github.com/huidongchen/simba
# In[1]:
import omicverse as ov
from omicverse.utils import mde
workdir = 'result_human_pancreas'
ov.utils.ov_plot_set()
# We need to install simba at first
#
# ```
# conda install -c bioconda simba
# ```
#
# or
#
# ```
# pip install git+https://github.com/huidongchen/simba
# pip install git+https://github.com/pinellolab/simba_pbg
# ```
# ## Read data
#
# The anndata object was concat from three anndata in simba: `simba.datasets.rna_baron2016()`, `simba.datasets.rna_segerstolpe2016()`, and `simba.datasets.rna_muraro2016()`
#
# It can be downloaded from figshare: https://figshare.com/ndownloader/files/41418600
# In[2]:
adata=ov.utils.read('simba_adata_raw.h5ad')
# We need to set workdir to initiate the pySIMBA object
# In[3]:
simba_object=ov.single.pySIMBA(adata,workdir)
# ## Preprocess
#
# Follow the raw tutorial, we set the paragument as default.
# In[4]:
simba_object.preprocess(batch_key='batch',min_n_cells=3,
method='lib_size',n_top_genes=3000,n_bins=5)
# ## Generate a graph for training
#
# Observations and variables within each Anndata object are both represented as nodes (entities).
#
# the data store in `simba_object.uns['simba_batch_edge_dict']`
# In[5]:
simba_object.gen_graph()
# ## PBG training
#
# Before training, let’s take a look at the current parameters:
#
# - dict_config['workers'] = 12 #The number of CPUs.
# In[10]:
simba_object.train(num_workers=6)
# In[6]:
simba_object.load('result_human_pancreas/pbg/graph0')
# ## Batch correction
#
# Here, we use `simba_object.batch_correction()` to perform the batch correction
#
# <div class="admonition note">
# <p class="admonition-title">Note</p>
# <p>
# If the batch is greater than 10, then the batch correction is less effective
# </p>
# </div>
# In[7]:
adata=simba_object.batch_correction()
adata
# ## Visualize
#
# We also use `mde` instead `umap` to visualize the result
# In[8]:
adata.obsm["X_mde"] = mde(adata.obsm["X_simba"])
# In[11]:
sc.pl.embedding(adata,basis='X_mde',color=['cell_type1','batch'])
# Certainly, umap can also be used to visualize
# In[10]:
import scanpy as sc
sc.pp.neighbors(adata, use_rep="X_simba")
sc.tl.umap(adata)
sc.pl.umap(adata,color=['cell_type1','batch'])
# In[ ]: