Question about interpreting perturber's output

#92
by DYXDAVE - opened

Hi. Thank you for such a good model. I have fine-tuned my cell classifier and classified my sequencing results, the performance is really good on tsne.
Now I'm trying to use the perturber to analyze different genes. But I found that it is hard to interpret the result from it. Are there any file that explains what those columns means? (for example the meaning of impact component, N_detection, Tset_avg_shift)
Thank you so much!

Also, I noticed that a gaussian mixture will fit to the perturber result and the impact component was determined by the 2 means of the gaussian mixture. So I'm wondering would this suggest we always use all gene in perturber since only use some of genes might cause the algorithm cannot actually separate impact genes and non impact genes?
Also, I found that I cannot use the anchor gene function.
My code was:
isp = InSilicoPerturber(perturb_type="delete",
perturb_rank_shift=None,
genes_to_perturb=["ENSG00000120217", "ENSG00000188389", "ENSG00000163599", "ENSG00000023445", "ENSG00000026950", "ENSG00000100368"
, "ENSG00000179144", "ENSG00000100453", "ENSG00000180353", "ENSG00000043462", "ENSG00000148488", "ENSG00000196932",
"ENSG00000130348", "ENSG00000197168", "ENSG00000049759", "ENSG00000241343", "ENSG00000135862"],
combos=1,
anchor_gene= "ENSG00000120217",
model_type="Pretrained",
num_classes=0,
emb_mode="cell_and_gene",
cell_emb_style="mean_pool",
filter_data={"celltype_minor":(["T cells CD8+"]), "subtype":(["TNBC"])},
cell_states_to_model=None,
max_ncells=2000,
emb_layer=-1,
forward_batch_size=14,
nproc=8,
save_raw_data=False)
And it says my anchor gene input was not string.

Thank you for your interest in Geneformer!

Thank you for your suggestion - I have added explanations for each of the possible columns in the output of the in silico perturber stats module, which can be accessed by "help(InSilicoPerturberStats)".

Regarding the anchor gene issue, this was due to the validation of input parameters and I have updated the code to resolve this.

For the mixture model, this strategy is used when the user would like to ask the question of which gene perturbations have a larger effect in a given cell population compared to other gene perturbations. Therefore, this approach necessarily should be comparing many different gene perturbations to determine the ones with a predicted larger impact. This approach is intended for unbiased discovery. However, even if the user is interested in just a few specific genes, the output .csv will also include the mean shift in response to each gene perturbation, so groups of genes could also be compared statistically by the user dependent on the user's scientific question.

On the other hand, if the user would like to ask the question of whether a given gene perturbation has a larger effect in a given cell population compared to that gene perturbation in other cell populations, the strategy to use for the in silico perturber stats would be "vs_null", where perturbation results in two different cell populations are presented to the model for comparison.

Finally, if the user would like to ask the question of which gene perturbations shift the cells towards a specific goal cell state, the strategy to use would be "goal_state_shift".

ctheodoris changed discussion status to closed

Sign up or log in to comment