ctheodoris/Geneformer · 90M example input data

Sep 25, 2024

Hey,
Could you provide some example input file data for the new 90M model (with the new tokenization scheme)?

Thanks in advance

Owner Oct 16, 2024

Thank you for your question! When we structure the 95M dataset repository we will add example input files there. However, the current default settings for the tokenizer, as below, will produce the tokenized datasets as needed for the 95M model. For example, the data from the embryonic stem cells dataset can be tokenized in this way to produce the analogous 95M tokenized dataset to this 30M one. You can confirm the output data has a max length of 4096 and that the special tokens are present (CLS token in the beginning of each cell and EOS token at the end of each cell). If you run into any issues, please let us know.

model_input_size=4096
special_token=True
collapse_gene_ids=True
gene_median_file=GENE_MEDIAN_FILE   #  95M file
token_dictionary_file=TOKEN_DICTIONARY_FILE   # 95M file
gene_mapping_file=ENSEMBL_MAPPING_FILE   # 95M file

ctheodoris changed discussion status to closed Oct 16, 2024