Question about rank value encoding
Thank you for your great work. It helps me a lot on my analysis.
I realized a small question when I'm thinking input size 2048. Since in the paper, you mentioned the input of a cells was the rank of gene expression normalized by 30M corpus, which means those strongly down regulated genes might got rank larger than 2048, and they will be excluded from the input. I'm wondering whether this exclusion will affect our analysis since these down regulated genes might be important when analyzing gene network. Do you think expanding the input size will further improve the model's understanding about gene network?
Thanks
Thank you for your question! The normalization factor that reflects the nonzero median value of expression across the ~30M single cells in Genecorpus-30M allows the model to take advantage of the vast number of observations of each gene’s expression across the pretraining corpus to deprioritize genes like housekeeping genes that are ubiquitously highly expressed while prioritizing genes like transcription factors that may be lowly expressed when they are expressed but highly distinguish cell state. Therefore, if the particular single cell had more than 2048 genes detected, the genes that extend past the 2048 input size will not simply be genes that have a lower expression value generally like transcription factors but more-so genes that are specifically downregulated in that cell population compared to the others in Genecorpus-30M (as you suggested). The absence of those genes is still highly informative to the model. The model learns from the absence of genes from the rank value encoding similarly to how NLP models learn from the absence of negative words from a movie review indicating a more positive sentiment, for example. Including a larger input size would still be beneficial, but there are trade-offs to the compute required given that fully dense attention is quadratic in time dependency. 2048 is a fairly large input size for fully dense attention, so we selected this size based on the fact that it fully encompassed 93% of the cells in Genecorpus-30M so well-balanced the available size and the required compute. We are working on future methodologies that extend the input size while still optimizing the compute required to further address this though. Please also see the related discussion here: https://huggingface.co/ctheodoris/Geneformer/discussions/134
Thank you for your answer. It definitely solved my question.
(I was thinking using the rank of absolute value of gene's fold change compare to the normalization factor since it prioritize the genes that changed a lot, but this encoding will lose the information of up-regulation and down-regulation, which makes is a bad choice)
It is reasonable that absent genes are informative to the model.