Christina Theodoris commited on
Commit
c34ead6
1 Parent(s): d468697

Add further explanation regarding input file format for transcriptome tokenizer

Browse files
examples/tokenizing_scRNAseq_data.ipynb CHANGED
@@ -17,7 +17,7 @@
17
  "source": [
18
  "#### Input data is a directory with .loom files containing raw counts from single cell RNAseq data, including all genes detected in the transcriptome without feature selection. \n",
19
  "\n",
20
- "#### Genes should be labeled with Ensembl IDs (row attribute \"ensembl_id\"), which provide a unique identifer for conversion to tokens.\n",
21
  "\n",
22
  "#### No cell metadata is required, but custom cell attributes may be passed onto the tokenized dataset by providing a dictionary of custom attributes to be added, which is formatted as loom_col_attr_name : desired_dataset_col_attr_name. For example, if the original .loom dataset has column attributes \"cell_type\" and \"organ_major\" and one would like to retain these attributes as labels in the tokenized dataset with the new names \"cell_type\" and \"organ\", respectively, the following custom attribute dictionary should be provided: {\"cell_type\": \"cell_type\", \"organ_major\": \"organ\"}. \n",
23
  "\n",
 
17
  "source": [
18
  "#### Input data is a directory with .loom files containing raw counts from single cell RNAseq data, including all genes detected in the transcriptome without feature selection. \n",
19
  "\n",
20
+ "#### Genes should be labeled with Ensembl IDs (row attribute \"ensembl_id\"), which provide a unique identifer for conversion to tokens. Cells should be labeled with the total read count in the cell (column attribute \"n_counts\") to be used for normalization.\n",
21
  "\n",
22
  "#### No cell metadata is required, but custom cell attributes may be passed onto the tokenized dataset by providing a dictionary of custom attributes to be added, which is formatted as loom_col_attr_name : desired_dataset_col_attr_name. For example, if the original .loom dataset has column attributes \"cell_type\" and \"organ_major\" and one would like to retain these attributes as labels in the tokenized dataset with the new names \"cell_type\" and \"organ\", respectively, the following custom attribute dictionary should be provided: {\"cell_type\": \"cell_type\", \"organ_major\": \"organ\"}. \n",
23
  "\n",
geneformer/tokenizer.py CHANGED
@@ -1,6 +1,13 @@
1
  """
2
  Geneformer tokenizer.
3
 
 
 
 
 
 
 
 
4
  Usage:
5
  from geneformer import TranscriptomeTokenizer
6
  tk = TranscriptomeTokenizer({"cell_type": "cell_type", "organ_major": "organ_major"}, nproc=4)
 
1
  """
2
  Geneformer tokenizer.
3
 
4
+ Input data:
5
+ Required format: raw counts scRNAseq data without feature selection as .loom file
6
+ Required row (gene) attribute: "ensembl_id"; Ensembl ID for each gene
7
+ Required col (cell) attribute: "n_counts"; total read counts in that cell
8
+ Optional col (cell) attribute: "filter_pass"; binary indicator of whether cell should be tokenized based on user-defined filtering criteria
9
+ Optional col (cell) attributes: any other cell metadata can be passed on to the tokenized dataset as a custom attribute dictionary as shown below
10
+
11
  Usage:
12
  from geneformer import TranscriptomeTokenizer
13
  tk = TranscriptomeTokenizer({"cell_type": "cell_type", "organ_major": "organ_major"}, nproc=4)