{
"cells": [
{
"cell_type": "markdown",
"id": "charged-worcester",
"metadata": {},
"source": [
"# Obtain non-zero median expression value of each gene across Genecorpus-30M"
]
},
{
"cell_type": "markdown",
"id": "28e87f2a-a33e-4fe3-81af-ad4cd62fcc1b",
"metadata": {},
"source": [
"#### Upon request, we are providing the code that we used for obtaining the non-zero median expression value of each gene across the broad range of cell types represented in Genecorpus-30M that we use as a normalization factor to prioritize genes that uniquely distinguish cell state.\n",
"\n",
"#### Please read the important information below before using this code.\n",
"\n",
"#### If using Geneformer, to ensure consistency of the normalization factor used for each gene for all future datasets, **users should use the Geneformer transcriptome tokenizer to tokenize their datasets and should not re-calculate this normalization factor for their individual dataset** . This code for re-calculating the normalization factor should only be used by users who are pretraining a new model from scratch with a new pretraining corpus other than Genecorpus-30M.\n",
"\n",
"#### It is critical that this calculation is performed on a large-scale pretraining corpus that has tens of millions of cells from a broad range of human tissues. **The richness of variable cell states in the pretraining corpus is what allows this normalization factor to accomplish the goal of prioritizing genes that uniquely distinguish cell states.** This normalization factor for each gene is calculated once from the large-scale pretraining corpus and is used for all future datasets presented to the model. \n",
"\n",
"#### Of note, as discussed in the Methods, we only included droplet-based sequencing platforms in the pretraining corpus to assure expression value unit comparability for the calculation of this normalization factor. Users wishing to pretrain a new model from scratch with a new pretraining corpus should choose either droplet-based or plate-based platforms for calculating this normalization factor, or they should exercise caution that including both platforms may cause unintended effects on the results. Once the normalization factor is calculated however, data from any platform can be used with the model because the expression value units will be consistent within each individual cell.\n",
"\n",
"#### Please see the Methods in the manuscript for a description of the procedure enacted by this code, an excerpt of which is below for convenience:\n",
"\n",
"#### \"To accomplish this, we first calculated the non-zero median value of expression of each detected gene across all cells passing quality filtering from the entire Genecorpus-30M. We aggregated the transcript count distribution for each gene in a memory-efficient manner by scanning through chunks of .loom data using loompy, normalizing the gene transcript counts in each cell by the total transcript count of that cell to account for varying sequencing depth and updating the normalized count distribution of the gene within the t-digest data structure developed for accurate online accumulation of rank-based statistics. We then normalized the genes in each single-cell transcriptome by the non-zero median value of expression of that gene across Genecorpus-30M and ordered the genes by the rank of their normalized expression in that specific cell. Of note, we opted to use the non-zero median value of expression rather than include zeros in the distribution so as not to weight the value by tissue representation within Genecorpus-30M, assuming that a representative range of transcript values would be observed within the cells in which each gene was detected. This normalization factor for each gene is calculated once from the pretraining corpus and is used for all future datasets presented to the model. The provided tokenizer code includes this normalization procedure and should be used for tokenizing new datasets presented to Geneformer to ensure consistency of the normalization factor used for each gene.\""
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "textile-destruction",
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"import numpy as np\n",
"import loompy as lp\n",
"import pandas as pd\n",
"import crick\n",
"import pickle\n",
"import math\n",
"from tqdm.notebook import tqdm"
]
},
{
"cell_type": "markdown",
"id": "4af8cfef-05f2-47e0-b8d2-71ca025059c7",
"metadata": {
"tags": []
},
"source": [
"### The following code is an example of how the nonzero median expression values are obtained for a single input file. This calculation should be run as a script to be parallelized for all dataset files."
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "physical-intro",
"metadata": {},
"outputs": [],
"source": [
"input_file = \"study1.loom\"\n",
"current_database = \"database1\"\n",
"\n",
"rootdir = f\"/path/to/{current_database}/data/\"\n",
"output_file = input_file.replace(\".loom\", \".gene_median_digest_dict.pickle\")\n",
"outdir = rootdir.replace(\"/data/\", \"/tdigest/\")\n",
"\n",
"with lp.connect(f\"{rootdir}{input_file}\") as data:\n",
" # define coordinates of protein-coding or miRNA genes\n",
" coding_miRNA_loc = np.where((data.ra.gene_type == \"protein_coding\") | (data.ra.gene_type == \"miRNA\"))[0]\n",
" coding_miRNA_genes = data.ra[\"ensembl_id\"][coding_miRNA_loc]\n",
" \n",
" # initiate tdigests\n",
" median_digests = [crick.tdigest.TDigest() for _ in range(len(coding_miRNA_loc))]\n",
" \n",
" # initiate progress meters\n",
" progress = tqdm(total=len(coding_miRNA_loc))\n",
" last_view_row = 0\n",
" progress.update(0)\n",
" \n",
" for (ix, selection, view) in data.scan(items=coding_miRNA_loc, axis=0):\n",
" # define coordinates of cells passing filter\n",
" filter_passed_loc = np.where(view.ca.filter_pass == 1)[0]\n",
" subview = view.view[:, filter_passed_loc]\n",
" # normalize by total counts per cell and multiply by 10,000 to allocate bits to precision\n",
" subview_norm_array = subview[:,:]/subview.ca.n_counts*10_000\n",
" # if integer, convert to float to prevent error with filling with nan\n",
" if np.issubdtype(subview_norm_array.dtype, np.integer):\n",
" subview_norm_array = subview_norm_array.astype(np.float32)\n",
" # mask zeroes from distribution tdigest by filling with nan\n",
" nonzero_data = np.ma.masked_equal(subview_norm_array, 0.0).filled(np.nan)\n",
" # update tdigests\n",
" [median_digests[i+last_view_row].update(nonzero_data[i,:]) for i in range(nonzero_data.shape[0])]\n",
" # update progress meters\n",
" progress.update(view.shape[0])\n",
" last_view_row = last_view_row + view.shape[0]\n",
" \n",
"median_digest_dict = dict(zip(coding_miRNA_genes, median_digests))\n",
"with open(f\"{outdir}{output_file}\", \"wb\") as fp:\n",
" pickle.dump(median_digest_dict, fp)"
]
},
{
"cell_type": "markdown",
"id": "190a3754-aafa-4ccf-ba97-951c94ea3030",
"metadata": {
"tags": []
},
"source": [
"### After the above code is run as a script in parallel for all datasets to obtain the nonzero median tdigests for their contained genes, the following code can be run to merge the tdigests across all datasets."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "distributed-riding",
"metadata": {},
"outputs": [],
"source": [
"# merge new tdigests into total tdigest dict\n",
"def merge_digest(dict_key_ensembl_id, dict_value_tdigest, new_tdigest_dict):\n",
" new_gene_tdigest = new_tdigest_dict.get(dict_key_ensembl_id)\n",
" if new_gene_tdigest is not None:\n",
" dict_value_tdigest.merge(new_gene_tdigest)\n",
" return dict_value_tdigest\n",
" elif new_gene_tdigest is None:\n",
" return dict_value_tdigest"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "distinct-library",
"metadata": {},
"outputs": [],
"source": [
"# use tdigest1.merge(tdigest2) to merge tdigest1, tdigest2, ...tdigestn\n",
"# then, extract median by tdigest1.quantile(0.5)\n",
"\n",
"databases = [\"database1\", \"database2\", \"...databaseN\"]\n",
"\n",
"# obtain gene list\n",
"gene_info = pd.read_csv(\"/path/to/gene_info_table.csv\", index_col=0)\n",
"func_gene_list = [i for i in gene_info[(gene_info[\"gene_type\"] == \"protein_coding\") | (gene_info[\"gene_type\"] == \"miRNA\")][\"ensembl_id\"]]\n",
"\n",
"# initiate tdigests\n",
"median_digests = [crick.tdigest.TDigest() for _ in range(len(func_gene_list))]\n",
"total_tdigest_dict = dict(zip(func_gene_list, median_digests))\n",
"\n",
"# merge tdigests\n",
"for current_database in databases:\n",
" rootdir = f\"/path/to/{current_database}/tdigest/\"\n",
" \n",
" for subdir, dirs, files in os.walk(rootdir):\t\n",
" for file in files:\n",
" if file.endswith(\".gene_median_digest_dict.pickle\"):\n",
" with open(f\"{rootdir}{file}\", \"rb\") as fp:\n",
" tdigest_dict = pickle.load(fp)\n",
" total_tdigest_dict = {k: merge_digest(k,v,tdigest_dict) for k, v in total_tdigest_dict.items()}\n",
"\n",
"# save dict of merged tdigests\n",
"with open(f\"/path/to/total_gene_tdigest_dict.pickle\", \"wb\") as fp:\n",
" pickle.dump(total_tdigest_dict, fp)\n",
"\n",
"# extract medians and save dict\n",
"total_median_dict = {k: v.quantile(0.5) for k, v in total_tdigest_dict.items()}\n",
"with open(f\"/path/to/total_gene_median_dict.pickle\", \"wb\") as fp:\n",
" pickle.dump(total_median_dict, fp)\n",
"\n",
"# save dict of only detected genes' medians \n",
"detected_median_dict = {k: v for k, v in total_median_dict.items() if not math.isnan(v)}\n",
"with open(f\"/path/to/detected_gene_median_dict.pickle\", \"wb\") as fp:\n",
" pickle.dump(detected_median_dict, fp)"
]
},
{
"cell_type": "markdown",
"id": "e8e17ad6-79ac-4f34-aa0c-1eaa1bace2e5",
"metadata": {
"tags": []
},
"source": [
"### The below code displays some characteristics of the genes detected in the pretraining corpus."
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "decent-switzerland",
"metadata": {},
"outputs": [],
"source": [
"gene_detection_counts_dict = {k: v.size() for k, v in total_tdigest_dict.items()}"
]
},
{
"cell_type": "code",
"execution_count": 44,
"id": "polished-innocent",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home1/ct68/miniconda3/lib/python3.8/site-packages/seaborn/distributions.py:2557: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).\n",
" warnings.warn(msg, FutureWarning)\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"