in_silico_perturbation: too many indices for tensor of dimension 2
Thank you for your help with the tool.
However, I encountered another error too many indices for tensor of dimension 2
below with the updated version. Was it due to the empty [] that includes all the other 27 classes in cell_states_to_model={'state_key': 'cell_type', 'start_state': 'A', 'goal_state': 'B', 'alt_states': []},
?
.
.
.
.
.
Map (num_proc=20): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββ| 1682/1682 [00:00<00:00, 3219.35 examples/s]
Map (num_proc=20): 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββ| 1681/1681 [00:01<00:00, 853.88 examples/s]
Map (num_proc=20): 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββ| 1681/1681 [00:00<00:00, 3065.67 examples/s]
---------------------------------------------------------------------------| 1334/1681 [00:00<00:00, 4718.28 examples/s]
IndexError Traceback (most recent call last)
Cell In[3], line 18
2 from geneformer import InSilicoPerturberStats
3 isp = InSilicoPerturber(perturb_type="overexpress",
4 perturb_rank_shift=None,
5 genes_to_perturb="all",
(...)
16 forward_batch_size=16,
17 nproc=20)
---> 18 isp.perturb_data("/mnt/c/Users/pc/Downloads/GF/230803_geneformer_CellClassifier_bone_marrow_L2048_B8_LR0.00028292192255361916_LSlinear_WU559.5193527108581_E1_Oadamw_F2",
19 "/GD/Sequencing_data/benchmark/transformer.dataset",
20 "/mnt/c/Users/pc/Downloads/GF",
21 "M7")
File /home/pc/miniconda3/envs/geneformer/lib/python3.10/site-packages/geneformer/in_silico_perturber.py:952, in InSilicoPerturber.perturb_data(self, model_directory, input_data_file, output_directory, output_prefix)
948 return example[state_name] in [start_state]
950 filtered_input_data = filtered_input_data.filter(filter_for_origin, num_proc=self.nproc)
--> 952 self.in_silico_perturb(model,
953 filtered_input_data,
954 layer_to_quant,
955 state_embs_dict,
956 output_directory,
957 output_prefix)
File /home/pc/miniconda3/envs/geneformer/lib/python3.10/site-packages/geneformer/in_silico_perturber.py:1096, in InSilicoPerturber.in_silico_perturb(self, model, filtered_input_data, layer_to_quant, state_embs_dict, output_directory, output_prefix)
1089 for combo_lvl in range(self.combos+1):
1090 perturbation_batch, indices_to_perturb = make_perturbation_batch(example_cell,
1091 self.perturb_type,
1092 self.tokens_to_perturb,
1093 self.anchor_token,
1094 combo_lvl,
1095 self.nproc)
-> 1096 cos_sims_data = quant_cos_sims(model,
1097 self.perturb_type,
1098 perturbation_batch,
1099 self.forward_batch_size,
1100 layer_to_quant,
1101 original_emb,
1102 self.tokens_to_perturb,
1103 indices_to_perturb,
1104 self.perturb_group,
1105 self.cell_states_to_model,
1106 state_embs_dict,
1107 self.pad_token_id,
1108 model_input_size,
1109 self.nproc)
1111 if self.cell_states_to_model is None:
1112 # update cos sims dict
1113 # key is tuple of (perturbed_gene, affected_gene)
1114 # or (perturbed_gene, "cell_emb") for avg cell emb change
1115 cos_sims_data = cos_sims_data.to("cuda")
File /home/pc/miniconda3/envs/geneformer/lib/python3.10/site-packages/geneformer/in_silico_perturber.py:389, in quant_cos_sims(model, perturb_type, perturbation_batch, forward_batch_size, layer_to_quant, original_emb, tokens_to_perturb, indices_to_perturb, perturb_group, cell_states_to_model, state_embs_dict, pad_token_id, model_input_size, nproc)
387 if perturb_group == True:
388 overexpressed_to_remove = len(tokens_to_perturb)
--> 389 minibatch_emb = minibatch_emb[:,overexpressed_to_remove:,:]
391 # if quantifying single perturbation in multiple different cells, pad original batch and extract embs
392 if perturb_group == True:
393 # pad minibatch of original batch to extract embeddings
394 # truncate to the (model input size - # tokens to overexpress) to ensure comparability
395 # since max input size of perturb batch will be reduced by # tokens to overexpress
IndexError: too many indices for tensor of dimension 2
Thank you for your question. Please update your comment with the arguments you are using to set up the in silico perturber so we can reproduce the error and help troubleshoot.
Thank you for the reminder. Please see below:
import os
os.environ['TRANSFORMERS_CACHE'] = '/mnt/c/Users/pc/Downloads/cache/'
from geneformer import InSilicoPerturber
from geneformer import InSilicoPerturberStats
isp = InSilicoPerturber(perturb_type="overexpress",
perturb_rank_shift=None,
genes_to_perturb="all",
combos=0,
anchor_gene=None,
model_type="CellClassifier",
num_classes=29,
emb_mode="cell",
cell_emb_style="mean_pool",
cell_states_to_model={'state_key': 'cell_type', 'start_state': 'A', 'goal_state': 'B', 'alt_states': []},
max_ncells=2000,
emb_layer=0,
forward_batch_size=16,
nproc=20)
isp.perturb_data("/mnt/c/Users/pc/Downloads/GF/230803_geneformer_CellClassifier_bone_marrow_L2048_B8_LR0.00028292192255361916_LSlinear_WU559.5193527108581_E1_Oadamw_F2",
"/GD/Sequencing_data/benchmark/transformer.dataset",
"/mnt/c/Users/pc/Downloads/GF",
"M7")
Thank you for including the set up you are using. I am not encountering this error when running with your arguments. I assume your cell classification fine-tuned model was trained to distinguish 29 classes and I assume you are using the current updated version. I also assume that the number of genes detected in all of your cells is >2. Assuming all that is true, please email me a link to your dataset and fine-tuned model so I can reproduce the error and help you troubleshoot.
Thank you for being so willing to help. I have emailed the fine-tuned model and the corresponding dataset to you.
Thank you for being so willing to help again. After switching to the cpu mode, I got it to work by setting forward_batch_size=400
. Thus, the issue might be related to low batch size with 4GB VRAM.
Thank you for this information!