Effect of gene knock-out during in-silico perturbations on gene embeddings
Hi,
We are investigating the effect of knocking-out a gene for the in-silico perturbation task and noticed some interesting results. Ultimately what we are trying to investigate is when a gene is perturbed, what other genes are changing in their embeddings, by comparing their embeddings pre and post perturbation. Our hope is to show that genes which embeddings change the most are biologically linked to each other.
It however seems that the genes embeddings which change the most is based on the rank of the gene perturbed. For example, for the first cell in the Genecorupus30m dataset, the gene at rank 1,000 or token 519, is knocked-out/removed from the input-sequence. By using cosine similarity to compare the gene embeddings pre and post perturbation, we can see which gene embeddings are changing.
The x-axis is the original gene rank of the gene, and the y axis is the cosine similarity of the gene embeddings pre and post perturbation. The first plot is all genes (except the one perturbed as it has no embeddings post perturbation), and the second plot is a zoomed in view (cosine similarity > 0.975)
It appears that there is a noticeable step-down in terms of cosine similarity for genes ranked 1001 - 2048 compared to genes ranked 0-999, with a few exceptions. From what I can tell is that this is due to the way a knock-out is performed on the input sequence. So when we knock-out a gene, we are actually performing multiple perturbations, the removal of the intended gene, plus increasing the rank of all other genes between the intended perturbation and the length of the sequence.
Questions:
- Is this expected? This behaviour can be noticed across multiple cells
- Do you think this matters in terms of cell shifts? Cell shifts may be influenced by the rank of the gene perturbed rather than an underlying biological understanding in the model
- Have you tried to use the MASK token for knock-outs? Rather than deleting a gene from the ranked list of genes, if we replaced it with a MASK token instead. This may remove the smaller single rank gene perturbations and keep the input sequence the same length.
I can share a notebook for reproducible results.
Thanks
Thank you for your question and providing the graphs of your analysis! That is an interesting observation. It would be great to determine whether this occurs at all possible ranks in the sequence or in particular regions more than others. Additionally, it would be great to determine whether this occurs with all genes or if deleting particular genes result in this pattern. We have not tested using the MASK token for knock-outs but it would be great to test this, or alternatively using an attention mask, to see whether this affects the pattern you are noticing. If you implement and test either of these options, it would be great if you could provide feedback of your results and/or contribute to the repository.