ctheodoris/Geneformer · tokenizer-uncropped-input

Adding option to keep uncropped input_ids as a featurecaec2855

jamieb-nvs

Nov 4, 2023

No description provided.

jamieb-nvs

Nov 4, 2023

Added option to create_dataset function in tokenizer.py to keep all gene tokens for a cell in a new feature 'input_ids_uncropped' and the total number of genes in a cell before truncation/cropping 'length_uncropped', in addition to previous truncation code.

This allows analysis of uncropped gene token ranks, which can be useful for understanding the coverage of genes across cells in dataset, and in how many cells is a gene <=2048, compared to > 2048.

The changes also move the cropping and length calculations into a single function, which saves iterating over the dataset twice, which can be useful for large datasets.

jamieb-nvs changed pull request status to open Nov 4, 2023

ctheodoris changed pull request status to merged Nov 6, 2023

ctheodoris

Owner Nov 6, 2023

Thank you for your valuable contribution to the codebase!

ctheodoris
/

Geneformer

tokenizer-uncropped-input_ids