Spaces:
Sleeping
Sleeping
#!/usr/bin/env python | |
# coding: utf-8 | |
# # Different Expression Analysis | |
# | |
# An important task of bulk rna-seq analysis is the different expression , which we can perform with omicverse. For different expression analysis, ov change the `gene_id` to `gene_name` of matrix first. When our dataset existed the batch effect, we can use the SizeFactors of DEseq2 to normalize it, and use `t-test` of `wilcoxon` to calculate the p-value of genes. Here we demonstrate this pipeline with a matrix from `featureCounts`. The same pipeline would generally be used to analyze any collection of RNA-seq tasks. | |
# | |
# Colab_Reproducibility:https://colab.research.google.com/drive/1q5lDfJepbtvNtc1TKz-h4wGUifTZ3i0_?usp=sharing | |
# In[1]: | |
import omicverse as ov | |
import scanpy as sc | |
import matplotlib.pyplot as plt | |
ov.plot_set() | |
# ## Geneset Download | |
# | |
# When we need to convert a gene id, we need to prepare a mapping pair file. Here we have pre-processed 6 genome gtf files and generated mapping pairs including `T2T-CHM13`, `GRCh38`, `GRCh37`, `GRCm39`, `danRer7`, and `danRer11`. If you need to convert other id_mapping, you can generate your own mapping using gtf Place the files in the `genesets` directory. | |
# In[2]: | |
ov.utils.download_geneid_annotation_pair() | |
# Note that this dataset has not been processed in any way and is only exported by `featureCounts`, and Sequence alignment was performed from the genome file of CRCm39 | |
# | |
# sample data can be download from: https://raw.githubusercontent.com/Starlitnightly/omicverse/master/sample/counts.txt | |
# In[3]: | |
#data=pd.read_csv('https://raw.githubusercontent.com/Starlitnightly/omicverse/master/sample/counts.txt',index_col=0,sep='\t',header=1) | |
data=ov.read('data/counts.txt',index_col=0,header=1) | |
#replace the columns `.bam` to `` | |
data.columns=[i.split('/')[-1].replace('.bam','') for i in data.columns] | |
data.head() | |
# ## ID mapping | |
# | |
# We performed the gene_id mapping by the mapping pair file `GRCm39` downloaded before. | |
# In[4]: | |
data=ov.bulk.Matrix_ID_mapping(data,'genesets/pair_GRCm39.tsv') | |
data.head() | |
# ## Different expression analysis with ov | |
# | |
# We can do differential expression analysis very simply by ov, simply by providing an expression matrix. To run DEG, we simply need to: | |
# | |
# - Read the raw count by featureCount or any other qualify methods. | |
# - Create an ov DEseq object. | |
# In[5]: | |
dds=ov.bulk.pyDEG(data) | |
# We notes that the gene_name mapping before exist some duplicates, we will process the duplicate indexes to retain only the highest expressed genes | |
# In[6]: | |
dds.drop_duplicates_index() | |
print('... drop_duplicates_index success') | |
# We also need to remove the batch effect of the expression matrix, `estimateSizeFactors` of DEseq2 to be used to normalize our matrix | |
# In[7]: | |
dds.normalize() | |
print('... estimateSizeFactors and normalize success') | |
# Now we can calculate the different expression gene from matrix, we need to input the treatment and control groups | |
# In[8]: | |
treatment_groups=['4-3','4-4'] | |
control_groups=['1--1','1--2'] | |
result=dds.deg_analysis(treatment_groups,control_groups,method='ttest') | |
result.head() | |
# One important thing is that we do not filter out low expression genes when processing DEGs, and in future versions I will consider building in the corresponding processing. | |
# In[9]: | |
print(result.shape) | |
result=result.loc[result['log2(BaseMean)']>1] | |
print(result.shape) | |
# We also need to set the threshold of Foldchange, we prepare a method named `foldchange_set` to finish. This function automatically calculates the appropriate threshold based on the log2FC distribution, but you can also enter it manually. | |
# In[10]: | |
# -1 means automatically calculates | |
dds.foldchange_set(fc_threshold=-1, | |
pval_threshold=0.05, | |
logp_max=6) | |
# ## Visualize the DEG result and specific genes | |
# | |
# To visualize the DEG result, we use `plot_volcano` to do it. This fuction can visualize the gene interested or high different expression genes. There are some parameters you need to input: | |
# | |
# - title: The title of volcano | |
# - figsize: The size of figure | |
# - plot_genes: The genes you interested | |
# - plot_genes_num: If you don't have interested genes, you can auto plot it. | |
# In[11]: | |
dds.plot_volcano(title='DEG Analysis',figsize=(4,4), | |
plot_genes_num=8,plot_genes_fontsize=12,) | |
# To visualize the specific genes, we only need to use the `dds.plot_boxplot` function to finish it. | |
# In[12]: | |
dds.plot_boxplot(genes=['Ckap2','Lef1'],treatment_groups=treatment_groups, | |
control_groups=control_groups,figsize=(2,3),fontsize=12, | |
legend_bbox=(2,0.55)) | |
# In[13]: | |
dds.plot_boxplot(genes=['Ckap2'],treatment_groups=treatment_groups, | |
control_groups=control_groups,figsize=(2,3),fontsize=12, | |
legend_bbox=(2,0.55)) | |
# ## Pathway enrichment analysis by ov | |
# | |
# Here we use the `gseapy` package, which included the GSEA analysis and Enrichment. We have optimised the output of the package and given some better looking graph drawing functions | |
# | |
# Similarly, we need to download the pathway/genesets first. Five genesets we prepare previously, you can use `ov.utils.download_pathway_database()` to download automatically. Besides, you can download the pathway you interested from enrichr: https://maayanlab.cloud/Enrichr/#libraries | |
# In[37]: | |
ov.utils.download_pathway_database() | |
# In[14]: | |
pathway_dict=ov.utils.geneset_prepare('genesets/WikiPathways_2019_Mouse.txt',organism='Mouse') | |
# Note that the `pvalue_type` we set to `auto`, this is because when the genesets we enrichment if too small, use the `adjusted pvalue` we can't get the correct result. So you can set `adjust` or `raw` to get the significant geneset. | |
# | |
# If you didn't have internet, please set `background` to all genes expressed in rna-seq,like: | |
# | |
# ```python | |
# enr=ov.bulk.geneset_enrichment(gene_list=deg_genes, | |
# pathways_dict=pathway_dict, | |
# pvalue_type='auto', | |
# background=dds.result.index.tolist(), | |
# organism='mouse') | |
# ``` | |
# In[15]: | |
deg_genes=dds.result.loc[dds.result['sig']!='normal'].index.tolist() | |
enr=ov.bulk.geneset_enrichment(gene_list=deg_genes, | |
pathways_dict=pathway_dict, | |
pvalue_type='auto', | |
organism='mouse') | |
# To visualize the enrichment, we use `geneset_plot` to finish it | |
# In[21]: | |
ov.bulk.geneset_plot(enr,figsize=(2,5),fig_title='Wiki Pathway enrichment', | |
cax_loc=[2, 0.45, 0.5, 0.02], | |
bbox_to_anchor_used=(-0.25, -13),node_diameter=10, | |
custom_ticks=[5,7],text_knock=3, | |
cmap='Reds') | |
# ## Multi pathway enrichment | |
# | |
# In addition to pathway enrichment for a single database, OmicVerse supports enriching and visualizing multiple pathways at the same time, which is implemented using [`pyComplexHeatmap`](https://dingwb.github.io/PyComplexHeatmap/build/html/notebooks/gene_enrichment_analysis.html), and Citetation is welcome! | |
# In[22]: | |
pathway_dict=ov.utils.geneset_prepare('genesets/GO_Biological_Process_2023.txt',organism='Mouse') | |
enr_go_bp=ov.bulk.geneset_enrichment(gene_list=deg_genes, | |
pathways_dict=pathway_dict, | |
pvalue_type='auto', | |
organism='mouse') | |
pathway_dict=ov.utils.geneset_prepare('genesets/GO_Molecular_Function_2023.txt',organism='Mouse') | |
enr_go_mf=ov.bulk.geneset_enrichment(gene_list=deg_genes, | |
pathways_dict=pathway_dict, | |
pvalue_type='auto', | |
organism='mouse') | |
pathway_dict=ov.utils.geneset_prepare('genesets/GO_Cellular_Component_2023.txt',organism='Mouse') | |
enr_go_cc=ov.bulk.geneset_enrichment(gene_list=deg_genes, | |
pathways_dict=pathway_dict, | |
pvalue_type='auto', | |
organism='mouse') | |
# In[167]: | |
enr_dict={'BP':enr_go_bp, | |
'MF':enr_go_mf, | |
'CC':enr_go_cc} | |
colors_dict={ | |
'BP':ov.pl.red_color[1], | |
'MF':ov.pl.green_color[1], | |
'CC':ov.pl.blue_color[1], | |
} | |
ov.bulk.geneset_plot_multi(enr_dict,colors_dict,num=3, | |
figsize=(2,5), | |
text_knock=3,fontsize=8, | |
cmap='Reds' | |
) | |
# In[166]: | |
def geneset_plot_multi(enr_dict,colors_dict,num:int=5,fontsize=10, | |
fig_title:str='',fig_xlabel:str='Fractions of genes', | |
figsize:tuple=(2,4),cmap:str='YlGnBu', | |
text_knock:int=5,text_maxsize:int=20,ax=None, | |
): | |
from PyComplexHeatmap import HeatmapAnnotation,DotClustermapPlotter,anno_label,anno_simple,AnnotationBase | |
for key in enr_dict.keys(): | |
enr_dict[key]['Type']=key | |
enr_all=pd.concat([enr_dict[i].iloc[:num] for i in enr_dict.keys()],axis=0) | |
enr_all['Term']=[ov.utils.plot_text_set(i.split('(')[0],text_knock=text_knock,text_maxsize=text_maxsize) for i in enr_all.Term.tolist()] | |
enr_all.index=enr_all.Term | |
enr_all['Term1']=[i for i in enr_all.index.tolist()] | |
del enr_all['Term'] | |
colors=colors_dict | |
left_ha = HeatmapAnnotation( | |
label=anno_label(enr_all.Type, merge=True,rotation=0,colors=colors,relpos=(1,0.8)), | |
Category=anno_simple(enr_all.Type,cmap='Set1', | |
add_text=False,legend=False,colors=colors), | |
axis=0,verbose=0,label_kws={'rotation':45,'horizontalalignment':'left','visible':False}) | |
right_ha = HeatmapAnnotation( | |
label=anno_label(enr_all.Term1, merge=True,rotation=0,relpos=(0,0.5),arrowprops=dict(visible=True), | |
colors=enr_all.assign(color=enr_all.Type.map(colors)).set_index('Term1').color.to_dict(), | |
fontsize=fontsize,luminance=0.8,height=2), | |
axis=0,verbose=0,#label_kws={'rotation':45,'horizontalalignment':'left'}, | |
orientation='right') | |
if ax==None: | |
fig, ax = plt.subplots(figsize=figsize) | |
else: | |
ax=ax | |
#plt.figure(figsize=figsize) | |
cm = DotClustermapPlotter(data=enr_all, x='fraction',y='Term1',value='logp',c='logp',s='num', | |
cmap=cmap, | |
row_cluster=True,#col_cluster=True,#hue='Group', | |
#cmap={'Group1':'Greens','Group2':'OrRd'}, | |
vmin=-1*np.log10(0.1),vmax=-1*np.log10(1e-10), | |
#colors={'Group1':'yellowgreen','Group2':'orange'}, | |
#marker={'Group1':'*','Group2':'$\\ast$'}, | |
show_rownames=True,show_colnames=False,row_dendrogram=False, | |
col_names_side='top',row_names_side='right', | |
xticklabels_kws={'labelrotation': 30, 'labelcolor': 'blue','labelsize':fontsize}, | |
#yticklabels_kws={'labelsize':10}, | |
#top_annotation=col_ha,left_annotation=left_ha,right_annotation=right_ha, | |
left_annotation=left_ha,right_annotation=right_ha, | |
spines=False, | |
row_split=enr_all.Type,# row_split_gap=1, | |
#col_split=df_col.Group,col_split_gap=0.5, | |
verbose=1,legend_gap=10, | |
#dot_legend_marker='*', | |
xlabel='Fractions of genes',xlabel_side="bottom", | |
xlabel_kws=dict(labelpad=8,fontweight='normal',fontsize=fontsize+2), | |
# xlabel_bbox_kws=dict(facecolor=facecolor) | |
) | |
tesr=plt.gcf().axes | |
for ax in plt.gcf().axes: | |
if hasattr(ax, 'get_xlabel'): | |
if ax.get_xlabel() == 'Fractions of genes': # 假设 colorbar 有一个特定的标签 | |
cbar = ax | |
cbar.grid(False) | |
if ax.get_ylabel() == 'logp': # 假设 colorbar 有一个特定的标签 | |
cbar = ax | |
cbar.tick_params(labelsize=fontsize+2) | |
cbar.set_ylabel(r'$−Log_{10}(P_{adjusted})$',fontsize=fontsize+2) | |
cbar.grid(False) | |
return ax | |
# In[ ]: | |