TypeError: get_embs() missing 1 required positional argument: 'token_gene_dict', when running InSilicoPerturber()
Hi, thanks for the great package. When I'm running with both the cardiomyopathy example and my own data, I get stuck on perturb_data() function, with the error message below.
Traceback (most recent call last):
File "", line 1, in
File "/miniconda3/envs/geneformer/lib/python3.12/site-packages/geneformer/in_silico_perturber.py", line 444, in perturb_data
self.isp_perturb_all(
File "/miniconda3/envs/geneformer/lib/python3.12/site-packages/geneformer/in_silico_perturber.py", line 734, in isp_perturb_all
full_original_emb = get_embs(
^^^^^^^^^
TypeError: get_embs() missing 1 required positional argument: 'token_gene_dict'
this is the InSilicoPerturber parameter i've run (and the perturb_data() that gives the error):
isp = InSilicoPerturber(perturb_type="delete",
perturb_rank_shift=None,
genes_to_perturb="all",
combos=0,
anchor_gene=None,
model_type="CellClassifier",
num_classes=3,
emb_mode="cell",
cell_emb_style="mean_pool",
filter_data=filter_data_dict,
cell_states_to_model=cell_states_to_model,
state_embs_dict=state_embs_dict,
max_ncells=2000,
emb_layer=0,
forward_batch_size=400,
nproc=16)
outputs intermediate files from in silico perturbation
isp.perturb_data("/download_from_repo/fine_tuned_models/geneformer_cardiomyopathies/",
"/download_from_repo/tutorial_cardomyopathies.dataset",
output_dir,
output_prefix)
i tried setting the "token_dictionary_file" to the one distributed from the geneformer package (token_dictionary.pkl) but that did not solve the issue. Please let me know if there is anything you need to reproduce the issue, and if you have a fix for this!!
Thank you again for this wonderful method.
I am having the same issue with my own example. I look like a bug in the code.
I manually fixed it, but it leads to another error after some computations.
RuntimeError:
An attempt has been made to start a new process before the
current process has finished its bootstrapping phase.
Running the script with nproc=1 helped.
I wonder if gene perturbation was tested...
This is a very crude fix. In the emb_extractor.py
, in the beginning of get_embs()
I added the following lines:
def get_embs(
...,
token_gene_dict = None,
...)
if token_gene_dict is None:
with open('/.../geneformer/token_dictionary.pkl', 'rb') as file:
gene_token_dict = pickle.load(file)
token_gene_dict = {v: k for k, v in gene_token_dict.items()}
Where ...
if your path to package files. You need to re-install the package are you made the changes. Obviously it is not a proper fix but allowed me to run the analysis.
Thank you for the discussions! We are working on other changes including fixing the gene token dict issue (we will be changing it in the in silico perturber code to avoid potentially using a gene token dictionary that does not match the model if there are multiple gene token dictionaries available).
Thank you for also pointing out the hundreds of genes for perturbation scenario. The code was designed to allow flexible usage so it's helpful to know if people encounter some issue that is a usage setting we haven't considered. We can add an option for a summarized prefix in the case the list is that long.
For the case of the genes not expressed in cells: because the current model has an input size limit of 2048, genes beyond 2048 in the rank value encoding will be truncated. We would suggest that you check whether the gene you are testing is present in the rank value encodings of the tokenized dataset. Because the data is scaled by the nonzero median expression value, the truncation cannot solely be determined by the relative expression level within the raw counts.
Also, you can install packages with the -e option with pip so that when you make changes locally they are reflected in the package without reinstallation. (If you are running code in a notebook, you do need to restart the kernel / reimport the package to reflect the local changes though).
Regarding the prior question about the RuntimeError: it would be helpful to know at what step this occurs. There is another open discussion where multiprocessing has an issue with Hugging Face Datasets's .map function. We made some suggestions in the other discussion. If you consistently reproduce this error and these changes help, please let us know so we can incorporate them.
https://huggingface.co/ctheodoris/Geneformer/discussions/369
Thank you so much for your reponse!
I have already removed a part of the comment about perturbing hundredreds of genes as I realized that genes are not processed one by one, but all together. I missed this point in the documentatin and of course perturbing hundreds of genes likely won't make much sense.
With regard to the missing genes, the limitation of 2048 input tokens of course make sense! I think I had this in mind when I started working on the code, but lost this thought on the way :)
By the way, I am hoping to find the way to increase the context size to 4 or 8K to adapt the model to other type of sequencing data. Do you think it could work?
Sounds good. Yes, you can increase the input size. Transformers are quadratic in time complexity with increasing input size so that would be the primary limitation.
Thanks for you reply! Could you by chance give any advice on what could be the best way to approach this problem?
The error from this discussion should be resolved now but let us know if it’s still causing issues.
Regarding larger input size, the updated model here has a larger input size of 4096. If you would like to pretrain a new model with a larger input size you can do that by changing the input size value in the pretraining code.