Spaces:
Sleeping
Sleeping
#!/usr/bin/env python | |
# coding: utf-8 | |
# # Different Expression Analysis with DEseq2 | |
# | |
# An important task of bulk rna-seq analysis is the different expression , which we can perform with omicverse. For different expression analysis, ov change the `gene_id` to `gene_name` of matrix first. | |
# | |
# Now we can use `PyDEseq2` to perform DESeq2 analysis like R | |
# | |
# Paper: [PyDESeq2: a python package for bulk RNA-seq differential expression analysis](https://www.biorxiv.org/content/10.1101/2022.12.14.520412v1) | |
# | |
# Code: https://github.com/owkin/PyDESeq2 | |
# | |
# Colab_Reproducibility:https://colab.research.google.com/drive/1fZS-v0zdIYkXrEoIAM1X5kPoZVfVvY5h?usp=sharing | |
# In[1]: | |
import omicverse as ov | |
ov.utils.ov_plot_set() | |
# Note that this dataset has not been processed in any way and is only exported by `featureCounts`, and Sequence alignment was performed from the genome file of CRCm39 | |
# In[2]: | |
data=ov.utils.read('https://raw.githubusercontent.com/Starlitnightly/Pyomic/master/sample/counts.txt',index_col=0,header=1) | |
#replace the columns `.bam` to `` | |
data.columns=[i.split('/')[-1].replace('.bam','') for i in data.columns] | |
data.head() | |
# ## ID mapping | |
# | |
# We performed the gene_id mapping by the mapping pair file `GRCm39` downloaded before. | |
# In[ ]: | |
ov.utils.download_geneid_annotation_pair() | |
# In[3]: | |
data=ov.bulk.Matrix_ID_mapping(data,'genesets/pair_GRCm39.tsv') | |
data.head() | |
# ## Different expression analysis with ov | |
# | |
# We can do differential expression analysis very simply by ov, simply by providing an expression matrix. To run DEG, we simply need to: | |
# | |
# - Read the raw count by featureCount or any other qualify methods. | |
# - Create an ov DEseq object. | |
# In[4]: | |
dds=ov.bulk.pyDEG(data) | |
# We notes that the gene_name mapping before exist some duplicates, we will process the duplicate indexes to retain only the highest expressed genes | |
# In[5]: | |
dds.drop_duplicates_index() | |
print('... drop_duplicates_index success') | |
# Now we can calculate the different expression gene from matrix, we need to input the treatment and control groups | |
# In[6]: | |
treatment_groups=['4-3','4-4'] | |
control_groups=['1--1','1--2'] | |
result=dds.deg_analysis(treatment_groups,control_groups,method='DEseq2') | |
# One important thing is that we do not filter out low expression genes when processing DEGs, and in future versions I will consider building in the corresponding processing. | |
# In[7]: | |
print(result.shape) | |
result=result.loc[result['log2(BaseMean)']>1] | |
print(result.shape) | |
# We also need to set the threshold of Foldchange, we prepare a method named `foldchange_set` to finish. This function automatically calculates the appropriate threshold based on the log2FC distribution, but you can also enter it manually. | |
# In[8]: | |
# -1 means automatically calculates | |
dds.foldchange_set(fc_threshold=-1, | |
pval_threshold=0.05, | |
logp_max=10) | |
# ## Visualize the DEG result and specific genes | |
# | |
# To visualize the DEG result, we use `plot_volcano` to do it. This fuction can visualize the gene interested or high different expression genes. There are some parameters you need to input: | |
# | |
# - title: The title of volcano | |
# - figsize: The size of figure | |
# - plot_genes: The genes you interested | |
# - plot_genes_num: If you don't have interested genes, you can auto plot it. | |
# In[9]: | |
dds.plot_volcano(title='DEG Analysis',figsize=(4,4), | |
plot_genes_num=8,plot_genes_fontsize=12,) | |
# To visualize the specific genes, we only need to use the `dds.plot_boxplot` function to finish it. | |
# In[10]: | |
dds.plot_boxplot(genes=['Ckap2','Lef1'],treatment_groups=treatment_groups, | |
control_groups=control_groups,figsize=(2,3),fontsize=12, | |
legend_bbox=(2,0.55)) | |
# In[11]: | |
dds.plot_boxplot(genes=['Ckap2'],treatment_groups=treatment_groups, | |
control_groups=control_groups,figsize=(2,3),fontsize=12, | |
legend_bbox=(2,0.55)) | |
# ## Pathway enrichment analysis by Pyomic | |
# | |
# Here we use the `gseapy` package, which included the GSEA analysis and Enrichment. We have optimised the output of the package and given some better looking graph drawing functions | |
# | |
# Similarly, we need to download the pathway/genesets first. Five genesets we prepare previously, you can use `Pyomic.utils.download_pathway_database()` to download automatically. Besides, you can download the pathway you interested from enrichr: https://maayanlab.cloud/Enrichr/#libraries | |
# In[13]: | |
ov.utils.download_pathway_database() | |
# In[14]: | |
pathway_dict=ov.utils.geneset_prepare('genesets/WikiPathways_2019_Mouse.txt',organism='Mouse') | |
# To perform the GSEA analysis, we need to ranking the genes at first. Using `dds.ranking2gsea` can obtain a ranking gene's matrix sorted by -log10(padj). | |
# | |
# $Metric=\frac{-log_{10}(padj)}{sign(log2FC)}$ | |
# In[15]: | |
rnk=dds.ranking2gsea() | |
# We used `ov.bulk.pyGSEA` to construst a GSEA object to perform enrichment. | |
# In[16]: | |
gsea_obj=ov.bulk.pyGSEA(rnk,pathway_dict) | |
# In[17]: | |
enrich_res=gsea_obj.enrichment() | |
# The results are stored in the `enrich_res` attribute. | |
# In[18]: | |
gsea_obj.enrich_res.head() | |
# To visualize the enrichment, we use `plot_enrichment` to do. | |
# - num: The number of enriched terms to plot. Default is 10. | |
# - node_size: A list of integers defining the size of nodes in the plot. Default is [5,10,15]. | |
# - cax_loc: The location of the colorbar on the plot. Default is 2. | |
# - cax_fontsize: The fontsize of the colorbar label. Default is 12. | |
# - fig_title: The title of the plot. Default is an empty string. | |
# - fig_xlabel: The label of the x-axis. Default is 'Fractions of genes'. | |
# - figsize: The size of the plot. Default is (2,4). | |
# - cmap: The colormap to use for the plot. Default is 'YlGnBu'. | |
# In[19]: | |
gsea_obj.plot_enrichment(num=10,node_size=[10,20,30], | |
cax_fontsize=12, | |
fig_title='Wiki Pathway Enrichment',fig_xlabel='Fractions of genes', | |
figsize=(2,4),cmap='YlGnBu', | |
text_knock=2,text_maxsize=30, | |
cax_loc=[2.5, 0.45, 0.5, 0.02], | |
bbox_to_anchor_used=(-0.25, -13),node_diameter=10,) | |
# Not only the basic analysis, pyGSEA also help us to visualize the term with Ranked and Enrichment Score. | |
# | |
# We can select the number of term to plot, which stored in `gsea_obj.enrich_res.index`, the `0` is `Complement and Coagulation Cascades WP449` and the `1` is `Matrix Metalloproteinases WP441` | |
# In[20]: | |
gsea_obj.enrich_res.index[:5] | |
# We can set the `gene_set_title` to change the title of GSEA plot | |
# In[22]: | |
fig=gsea_obj.plot_gsea(term_num=1, | |
gene_set_title='Matrix Metalloproteinases', | |
figsize=(3,4), | |
cmap='RdBu_r', | |
title_fontsize=14, | |
title_y=0.95) | |
# In[ ]: | |