tokenizer-uncropped-input_ids
#275
by
jamieb-nvs
- opened
No description provided.
Added option to create_dataset function in tokenizer.py to keep all gene tokens for a cell in a new feature 'input_ids_uncropped' and the total number of genes in a cell before truncation/cropping 'length_uncropped', in addition to previous truncation code.
This allows analysis of uncropped gene token ranks, which can be useful for understanding the coverage of genes across cells in dataset, and in how many cells is a gene <=2048, compared to > 2048.
The changes also move the cropping and length calculations into a single function, which saves iterating over the dataset twice, which can be useful for large datasets.
jamieb-nvs
changed pull request status to
open
ctheodoris
changed pull request status to
merged
Thank you for your valuable contribution to the codebase!