anndata_tokenizer
Thank you for sharing your contribution in this pull request!
- Regarding the Dataset.from_generator, when testing it with multiple runs to get a better estimate (since runs can vary in time at random), it unfortunately does seem to be slower than Dataset.from_dict for larger datasets. This is perhaps to be expected given the for loop and generation portion. Of note, Datasets caches the generator so repeated runs may lead to the misleading impression that it is faster, while repeated generation is not practically how users would access this function.
Dataset size: from_dict vs from_generator
Small dataset ~5K cells: 0.071 vs 0.072
Medium dataset ~200K cells: 3.000 vs 20.635
Large dataset ~1M cells: 15.761 vs 100.186
Upon searching the Huggingface Datasets issues though, the error in discussion #80 appears to be a known problem that they are working on resolving, so hopefully in future versions this will not be an issue. I would like to explore some other options but I am not encountering the error when testing datasets with ~1M cells. Would you be able to share one of the problematic datasets with me so that I can reproduce the error and therefore work on resolving it?
- For the anndata tokenizer, I tested it with an .h5ad dataset with ~1M cells and encountered a few errors so far. The first error was not accounting for the possibility of no metadata dictionary, so I added if statements to handle that case similarly to the loom version. However, the next error is below. I believe it may be due to not handling the case of the filter_pass existing, leading to adata_filter.X involving the data filtered by both cells and genes on line 203 below, while the adata.X is not filtered by cells.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[1], line 3
1 from geneformer import TranscriptomeTokenizer
2 tk = TranscriptomeTokenizer(nproc=16)
----> 3 tk.tokenize_data("/path/to/h5ad_1Mcells",
4 "/path/to/h5ad_1Mcells/output",
5 "tokenized_h5ad_1Mcells",
6 file_format="h5ad")
File ~/Geneformer/geneformer/tokenizer.py:117, in TranscriptomeTokenizer.tokenize_data(self, data_directory, output_directory, output_prefix, file_format)
97 def tokenize_data(
98 self,
99 data_directory: Path | str,
(...)
102 file_format: Literal["loom", "h5ad"] = "loom",
103 ):
104 """
105 Tokenize .loom files in loom_data_directory and save as tokenized .dataset in output_directory.
106 Parameters
(...)
115 Format of input files. Can be "loom" or "h5ad".
116 """
--> 117 tokenized_cells, cell_metadata = self.tokenize_files(
118 Path(data_directory), file_format
119 )
120 tokenized_dataset = self.create_dataset(tokenized_cells, cell_metadata)
122 output_path = (Path(output_directory) / output_prefix).with_suffix(".dataset")
File ~/Geneformer/geneformer/tokenizer.py:146, in TranscriptomeTokenizer.tokenize_files(self, data_directory, file_format)
144 file_found = 1
145 print(f"Tokenizing {file_path}")
--> 146 file_tokenized_cells, file_cell_metadata = tokenize_file_fn(file_path)
147 tokenized_cells += file_tokenized_cells
148 if self.custom_attr_name_dict is not None:
File ~/Geneformer/geneformer/tokenizer.py:203, in TranscriptomeTokenizer.tokenize_anndata(self, adata_file_path)
198 tokenized_cells = []
199 adata_filter = adata[
200 filter_pass_loc, coding_miRNA_loc # filter cells and genes
201 ]
--> 203 X_norm = (adata_filter.X / adata.X.sum(1) * 10_000 / norm_factor_vector).tocsr()
205 tokenized_cells += [
206 tokenize_cell(X_norm[i, ...].A.flatten(), coding_miRNA_tokens)
207 for i in range(X_norm.shape[0])
208 ]
210 # add custom attributes for subview to dict
File ~/miniconda3/lib/python3.10/site-packages/scipy/sparse/_base.py:686, in spmatrix.__truediv__(self, other)
685 def __truediv__(self, other):
--> 686 return self._divide(other, true_divide=True)
File ~/miniconda3/lib/python3.10/site-packages/scipy/sparse/_base.py:665, in spmatrix._divide(self, other, true_divide, rdivide)
663 if not rdivide:
664 if true_divide:
--> 665 return np.true_divide(self.todense(), other)
666 else:
667 return np.divide(self.todense(), other)
ValueError: operands could not be broadcast together with shapes (986122,24124) (1002756,1)
However, when I add filtering for the cells to get adata_cell_filter (see below), the operation leads to the error that the matrix object has no attribute "tocsr".
File ~/Geneformer/geneformer/tokenizer.py:209, in TranscriptomeTokenizer.tokenize_anndata(self, adata_file_path)
202 adata_filter = adata[
203 filter_pass_loc, coding_miRNA_loc # filter cells and genes
204 ]
205 adata_cell_filter = adata[
206 filter_pass_loc, : # filter cells only
207 ]
--> 209 X_norm = (adata_filter.X / adata_cell_filter.X.sum(1) * 10_000 / norm_factor_vector).tocsr()
211 tokenized_cells += [
212 tokenize_cell(X_norm[i, ...].A.flatten(), coding_miRNA_tokens)
213 for i in range(X_norm.shape[0])
214 ]
216 # add custom attributes for subview to dict
AttributeError: 'matrix' object has no attribute 'tocsr'
It would be great if you could take a look into resolving these for the anndata version. I'm not sure if you encountered something similar.
Additionally, I noticed that the anndata version does not perform this operation by scanning through the file the way that the look version does. It requires >500G RAM to perform this operation for large datasets ~1M cells. If you know of an anndata function that would be preferable to allow scanning through the file in chunks similar to the loom version, that would be great to avoid memory constraints.
Thank you for your collaboration on this!
The first error was not accounting for the possibility of no metadata dictionary, so I added if statements to handle that case similarly to the loom version
What arguments did you pass to the TranscriptomeTokenizer that made those if statements necessary? I ran it with:
tk = TranscriptomeTokenizer({})
# and
tk = TranscriptomeTokenizer({"cell_type": "cell_type"})
and both worked fine
I also did not run into the 2nd issue you mentioned above. For me, changing line 203 in tokenizer.py
to:
X_norm = (adata_filter.X / adata[filter_pass_loc].X.sum(1) * 10_000 / norm_factor_vector).tocsr()
made it work.
My relevant package versions are:
anndata==0.9.1
arrow==1.2.3
datasets==2.13.1
numpy==1.24.3
pyarrow==12.0.1
scipy==1.11.0
torch==2.0.1
transformers==4.30.2
Which versions are you using?
@ctheodoris
I addressed the issues you mentioned above by casting the matrix to a csr matrix, scanning through the anndata object (in "backed" mode) instead of loading it into memory and using a parameter to decide whether to use from_dict
or from_generator
Thank you so much! That’s wonderful. I’m currently away with limited internet access so will test the updated version when I return and merge it if all looks good. Thank you for your key contribution to the code base!
Thank you again for your collaboration on this. I have returned and am testing out the new version. A couple of remaining issues:
- The default for custom_attr_name_dict is None so if it isn't specified (see how I ran the code in the error trace below) the anndata tokenizer has an error. There are if statements in the loom version that account for this. I added them to resolve this:
Changes to add:
Lines 168-171:
if self.custom_attr_name_dict is not None:
file_cell_metadata = {
attr_key: [] for attr_key in self.custom_attr_name_dict.keys()
}
Lines 218-223:
# add custom attributes for subview to dict
if self.custom_attr_name_dict is not None:
for k in file_cell_metadata.keys():
file_cell_metadata[k] += adata[idx].obs[k].tolist()
else:
file_cell_metadata = None
- I encountered the error below when calculating the X_norm. Are you able to resolve this? (see how I ran the code in the error trace below; the anndata file has a filter_pass cell attribute)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[1], line 3
1 from geneformer import TranscriptomeTokenizer
2 tk = TranscriptomeTokenizer(nproc=16)
----> 3 tk.tokenize_data("/path/to/h5ad_1Mcells",
4 "/path/to/h5ad_1Mcells/output",
5 "tokenized_h5ad_1Mcells",
6 file_format="h5ad")
File ~/Geneformer/geneformer/tokenizer.py:128, in TranscriptomeTokenizer.tokenize_data(self, data_directory, output_directory, output_prefix, file_format, use_generator)
105 def tokenize_data(
106 self,
107 data_directory: Path | str,
(...)
111 use_generator: bool = False,
112 ):
113 """
114 Tokenize .loom files in loom_data_directory and save as tokenized .dataset in output_directory.
115 Parameters
(...)
126 Whether to use generator or dict for tokenization.
127 """
--> 128 tokenized_cells, cell_metadata = self.tokenize_files(
129 Path(data_directory), file_format
130 )
131 tokenized_dataset = self.create_dataset(tokenized_cells, cell_metadata, use_generator=use_generator)
133 output_path = (Path(output_directory) / output_prefix).with_suffix(".dataset")
File ~/Geneformer/geneformer/tokenizer.py:152, in TranscriptomeTokenizer.tokenize_files(self, data_directory, file_format)
150 file_found = 1
151 print(f"Tokenizing {file_path}")
--> 152 file_tokenized_cells, file_cell_metadata = tokenize_file_fn(file_path)
153 tokenized_cells += file_tokenized_cells
154 if self.custom_attr_name_dict is not None:
File ~/Geneformer/geneformer/tokenizer.py:210, in TranscriptomeTokenizer.tokenize_anndata(self, adata_file_path, target_sum, chunk_size)
207 idx = filter_pass_loc[i:i+chunk_size]
208 X = adata[idx].X
--> 210 X_norm = (X / X[:, coding_miRNA_loc].sum(axis=1) * target_sum / norm_factor_vector)
211 X_norm = sp.csr_matrix(X_norm)
213 tokenized_cells += [
214 rank_genes(X_norm[i].data, coding_miRNA_tokens[X_norm[i].indices])
215 for i in range(X_norm.shape[0])
216 ]
ValueError: operands could not be broadcast together with shapes (512,63561) (24124,)
@ctheodoris I addressed your issues, let me know if they're fixed
Thank you so much for addressing these! Indeed the code is able to run now. However, when I check the results, it seems they are not the same between the anndata and loom versions unfortunately. I used scanpy to convert a .loom file to an .h5ad file so they would be the same, and then ran them each through the transcriptome tokenizer, either specifying to use the anndata version or not specifying a file type (thereby using the default loom one). When I use the following to create a checksum column in the datasets, and then create a set out of the checksum column, the two sets are not the same. They have the same number of entries (cells) but the checksum itself is not the same (each set has 986122 cells, but then merging them has more than 986122 (986320), indicating that some are not the same). I am happy to send you these input/output datasets if you email me so we can troubleshoot the reasons behind this. Did you check the outputs previously and find they were the same though?
def create_checksum(example):
example["checksum"] = hash(tuple(example["input_ids"]))
return example
test_dataset = test_dataset.map(create_checksum, num_proc=16)
Thank you again for your collaboration on this!
Looks great, checksums match - thank you so much for all your collaboration on this. It is a valuable contribution that will be helpful to many researchers.