Question about classifying cancer cells
Thank you for your great work on Geneformer.
I'm currently handling a breast cancer scRNA-seq dataset, and Im trying to classify the cells in it and comparing it with the clustering results. Since I'm novice in transfer learning and I noticed that you exclude malignant in your training set. Does this means it might be not a good idea to classifying breast cancer cells using the fine-tuned, pre-trained model?
Or, if it is possible to classify breast cancer cells, what strategy should I use? should I collect marked breast cancer cells' scRNA-seq public dataset, use the tokenizer, and fine-tune the pre-trained model?
Thanks
David
Thank you for your interest in Geneformer. As stated in the manuscript, we excluded cells with high mutational burdens (e.g. malignant cells and immortalized cell lines) that may lead to substantial network rewiring without companion genome sequencing to facilitate interpretation. While this led to less cells being included overall, we wanted to prioritize training a high quality embedding space with the knowledge that the model could always be fine-tuned towards cancer applications after the pretraining established its fundamental understanding of network dynamics. The approach you outlined is exactly how we would suggest proceeding. You can follow the disease classifier example to tune the learning hyperparameters for your application. We would suggest you use multiple labeled breast scRNAseq datasets if available for fine-tuning to ensure the model observes a broad range of examples so that it does not overfit to a particular dataset and is generalizable to your new dataset.
As with any machine learning, we suggest you separate your data into train, validation, and test sets, where the train and validation sets are different individuals as opposed to cells subsampled from the same individual, and the test set is a totally separate held-out dataset (if available) to confirm your model performance after fine-tuning.
I used the perturbation in this model and the result is pretty good when I cross validate with online resources/papers, which suggest the pre-trained model is good. But when I tried to fine-tune my cell-classification model, it turns out that the model tend to classify all cells into one category after find-tune. It makes me thought whether it is caused by overwhelming cancerous epithelial cells. So is this the case when we need to use the disease classification model? Or it is possible I can find another dataset to fine-tune my model(like specific to one subtype of breast cancer/broad to healthy cells and subtypes of cancerous cells )? Thanks!
Thank you for your question. The disease classification is the same process as cell state classification as disease is a category of cell state. There are many factors that affect fine-tuning, including the dataset you used, whether the classes were balanced, and whether you fully optimized the learning hyperparameters. It sounds like you may have a significant class imbalance so you may consider balancing the classes better in your training dataset.