#!/usr/bin/env python
# coding: utf-8

# # TCGA database preprocess
# 
# We often download patient survival data from the TCGA database for analysis in order to verify the importance of genes in cancer. However, the pre-processing of the TCGA database is often a headache. Here, we have introduced the TCGA module in ov, a way to quickly process the file formats we download from the TCGA database. We need to prepare 3 files as input:
# 
# - gdc_sample_sheet (`.tsv`): The `Sample Sheet` button of TCGA, and we can get `tsv` file from it
# - gdc_download_files (`folder`): The `Download/Cart` button of TCGA, and we get `tar.gz` included all file you selected/
# - clinical_cart (`folder`): The `Clinical` button of TCGA, and we can get `tar.gz` included all clinical of your files

# In[1]:


import omicverse as ov
import scanpy as sc
ov.plot_set()


# ## TCGA counts read
# 
# Here, we use `ov.bulk.TCGA` to perform the `gdc_sample_sheet`, `gdc_download_files` and `clinical_cart` you download before. The raw count, fpkm and tpm matrix will be stored in anndata object

# In[2]:


get_ipython().run_cell_magic('time', '', "gdc_sample_sheep='data/TCGA_OV/gdc_sample_sheet.2024-07-05.tsv'\ngdc_download_files='data/TCGA_OV/gdc_download_20240705_180129.081531'\nclinical_cart='data/TCGA_OV/clinical.cart.2024-07-05'\naml_tcga=ov.bulk.pyTCGA(gdc_sample_sheep,gdc_download_files,clinical_cart)\naml_tcga.adata_init()\n")


# We can save the anndata object for the next use

# In[3]:


aml_tcga.adata.write_h5ad('data/TCGA_OV/ov_tcga_raw.h5ad',compression='gzip')


# Note: Each time we read the anndata file, we need to initialize the TCGA object using three paths so that the subsequent TCGA functions such as survival analysis can be used properly
# 
# If you wish to create your own TCGA data, we have provided [sample data](https://figshare.com/ndownloader/files/47461946) here for download:
# 
# TCGA OV: https://figshare.com/ndownloader/files/47461946

# In[2]:


gdc_sample_sheep='data/TCGA_OV/gdc_sample_sheet.2024-07-05.tsv'
gdc_download_files='data/TCGA_OV/gdc_download_20240705_180129.081531'
clinical_cart='data/TCGA_OV/clinical.cart.2024-07-05'
aml_tcga=ov.bulk.pyTCGA(gdc_sample_sheep,gdc_download_files,clinical_cart)
aml_tcga.adata_read('data/TCGA_OV/ov_tcga_raw.h5ad')


# ## Meta init
# As the TCGA reads the gene_id, we need to convert it to gene_name as well as adding basic information about the patient. Therefore we need to initialise the patient's meta information.

# In[3]:


aml_tcga.adata_meta_init()


# ## Survial init
# We set up the path for Clinical earlier, but in fact we did not import the patient information in the previous process, we only initially determined the id of the patient's TCGA, so we attracted to initialize the clinical information

# In[4]:


aml_tcga.survial_init()
aml_tcga.adata


# To visualize the gene you interested, we can use `survival_analysis` to finish it. 

# In[5]:


aml_tcga.survival_analysis('MYC',layer='deseq_normalize',plot=True)


# If you want to calculate the survival of all genes, you can also use the `survial_analysis_all` to finish it. It may calculate a lot of times.

# In[ ]:


aml_tcga.survial_analysis_all()
aml_tcga.adata


# Don't forget to save your result.

# In[ ]:


aml_tcga.adata.write_h5ad('data/TCGA_OV/ov_tcga_survial_all.h5ad',compression='gzip')