How Were Input Data and Confusion Matrices Generated for Gene Classification Models in Extended Data Figure 7 a, d, e?
In the a, d, and e panels of this figure (https://www.nature.com/articles/s41586-023-06139-9/figures/13), Geneformer is compared against SVM, Logistic Regression (LR), and Random Forest (RF) for gene classification tasks. The models are evaluated using two types of input data: -r (ranks) and -c (counts). I am interested in understanding how the input data for these models was generated. Based on my understanding, the process might have been as follows:
Input Data Generation:
Fix the number of cells (n) to create the input for each gene.
For counts (-c), a vector of size n is generated where each entry corresponds to the count of that gene across the n cells.
For ranks (-r), the input is a vector of size n where:
If the gene is present in a cell (count > 0), its original rank (based on expression level) is used.
If the gene is absent in a cell, a high rank value (e.g., a maximum rank) is assigned to indicate its absence.
Testing:
Unlike Geneformer, which provides predictions for each gene in every sample resulting in multiple predictions per gene used to create the confusion matrix on the test dataset; SVM, LR, and RF generate a single prediction per gene, and the confusion matrix is created based on these single predictions.
Thank you for your question! This repository is for asking questions, raising issues, or providing feedback on the Geneformer model/code itself, so please email me for any questions about our papers, for example things like evaluation with alternative approaches, methods for measuring contractility in cells to validate the predicted therapeutic targets, etc. Thank you!