tokenizing leads to missing genes

#506
by marvany - opened

When tokenizing 100 genes (that were extracted from token dictionary) to make synthetic data, in the generated .dataset only 2 genes (with the arrow file) are found.
And when I used an even smaller dataset then 0 genes were actually found in the arrow file.

Even more weirdly all of my columns (which are randomly filled) end up with the same exact entries in differen torder,
either:
input_ids length
1 [2, 20034, 19311, 3] 4

or
input_ids length
1 [2, 19311, 20034, 3] 4

Could you please provide assistance? Thank you in advance.

marvany changed discussion status to closed
marvany changed discussion title from Import Classifier fails - AdamW error to unpickling error
marvany changed discussion status to open
marvany changed discussion title from unpickling error to tokenizing leads to missing genes

Thanks for your question. In order for the gene to be added to the rank value encoding, the gene must exist in the cell. It is expected that if you have a smaller sample size, the chance the 100 genes are found in those cells is even smaller. The order is expected to change based on the relative ranking of the genes after the scaling is applied. Please also ensure you are using the same dictionary to extract the genes as the one you are using for tokenization. If you have additional questions, please provide the code you are using to complete this process as it’s not a usual process and of note is not recommended for modeling with Geneformer, as discussed in the tokenization example. It is recommended that no gene feature selection be applied prior to modeling with Geneformer as this will result in the cells being out of distribution from how the model is normally trained.

ctheodoris changed discussion status to closed

Thank you for your answer, I believe my issue had to do with gene mapping. I editted the following section (see below) of the code in tokenizer.py (lines ~ 125-145) and now it works (on synthetic data at least):
Briefly; the "ensembl_ids" variable contained 'ENSG...' while gene_mapping_dict.keys() contained gene symbols (e.g. 'ABCD' gene)
changing to gene_mapping_dict.values() with some accompanying modifications, solved the issue.

Please let me know if I am missing some sort of argument or if I am using faulty input.

Here is the editted code in tokenizer.py :

Get the genes that exist in the mapping dictionary and the value of those genes # to restore original code; uncomment "ma comments" and comment "ma lines"

        #genes_in_map_dict = [gene for gene in ensembl_ids if gene in gene_mapping_dict.keys()]     # ma comment
        #vals_from_map_dict = [gene_mapping_dict.get(gene) for gene in genes_in_map_dict]            # ma comment
        genes_in_map_dict = [gene for gene in ensembl_ids if gene in gene_mapping_dict.values()]    # ma line
        
        # if the genes in the mapping dict and the value of those genes are of the same length,
        # simply return the mapped values
        #if(len(set(genes_in_map_dict)) == len(set(vals_from_map_dict))):                                                         # ma comment
        if(True):
            #mapped_vals = [gene_mapping_dict.get(gene.upper()) for gene in data.ra["ensembl_id"]]                # ma comment
            mapped_vals = [gene if gene in genes_in_map_dict else None for gene in data.ra["ensembl_id"]]     # ma line
            data.ra["ensembl_id_collapsed"] = mapped_vals
            return data_directory

Below are all the steps I followed for tokenizing synthetic data

#####################################

Here are the dictionaries I was using

geneformer/token_dictionary_gc95M.pkl
geneformer/gene_median_dictionary_gc95M.pkl
geneformer/gene_name_id_dict_gc95M.pkl

#####################################

I sampled genes the following way:

Load the token dictionary (used for mapping between Ensembl IDs and tokens)

with open(token_dictionary_path, "rb") as f:
token_dictionary = pickle.load(f)

Extract Ensembl IDs (keys from the dict, except special tokens)

ensembl_ids = [gene_id for gene_id in token_dictionary.keys()
if not gene_id.startswith("<") and not gene_id.endswith(">")]

Store random 100 genes

sampled_genes = random.sample(ensembl_ids, 100)

##########################################################

I then created a matrix with random values to create loom file

Define number of cells to generate

num_cells = 50

Create synthetic gene expression table

expression_matrix = np.zeros((len(sampled_genes), num_cells), dtype=np.float32)

Fill in the synthetic table with random values

for i in range(len(sampled_genes)):
for j in range(num_cells):
expression_matrix[i,j] = random.randint(1,30)

Create row attributes (ensebml_ID information)

row_attrs = {
"ensembl_id": np.array(sampled_genes)
}

Create column attributes

col_attrs = {
"n_counts": np.sum(expression_matrix, axis = 0)
}
#loompy.create("./temp_tokenization/loom_files/test_data2.loom", expression_matrix, row_attrs, col_attrs)
loompy.create("./loom_files/test_data2.loom", expression_matrix, row_attrs, col_attrs)

##########################################################

Finally I tokenized the data

Initialize tokenizer with custom dictionary paths

tokenizer = TranscriptomeTokenizer(
custom_attr_name_dict=None,
nproc=1,
token_dictionary_file=token_dictionary_path, # I was getting errors with the default dictionaries so all dictionary paths were filled in manually (errors had to do with faulty git lfs download, probably HPC issue)
gene_median_file=gene_median,
gene_mapping_file=gene_mapping
)

Tokenize data

loom_dir_str = './loom_files/'
output_dir_str = './temp_tokenization/tokenized3/'
output_prefix_name = 'tokenized_data2'

print("loom_dir_str contains " + str(os.listdir(loom_dir_str)))

tokenizer.tokenize_data(
data_directory=loom_dir_str,
output_directory=output_dir_str,
output_prefix=output_prefix_name,
file_format="loom"
)

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment