Question about test dataset
Hi, thank you for your model.
I'm currently looking the code for the cell classification. I'm not sure whether the test dataset was used in the code for cell classification. Since the code provided only did 8:2 cross validation. The variable "eval_dataset=load_from_disk("/path/to/cell_type_test_data.dataset")" seems not used in cell classification notebook, it was changed to organ_evalset(which is a subset of the training dataset) in the trainer part. So should we use the cell_type_test_data.dataset to test the fine-tuned model?
Thank you for pointing out the unused variable. I updated the notebook to remove the unused variable to avoid confusion. The 20% held out data was used for evaluating the fine-tuned model for both Geneformer and the alternative models. Because we did not optimize hyperparameters for Geneformer or the alternative models for this application, we did not require a third held out dataset. If you optimize hyperparameters or in any way change the training process in response to evaluation results from a validation dataset, you must then evaluate the model performance on a third held out test dataset. Of note, we strongly recommend optimizing hyperparameters for any new downstream application or dataset - we didn't do so in this case only in order to maintain an equivalent comparison between the models.
I'm not sure my code was right or not. I fine-tuned my model based on train set's lung cells and use test set to test the model. (I only select lung cells that are included in the cell type dict) and found the classification isn't that good. Did you tested the model using the test set? Are the test set includes same kinds of lung cells compare with train set? Thanks!
Thank you for letting us know - as discussed above, the test set used for evaluating our model as well as the alternative approaches was the held out 20%. We did not optimize hyperparameters or otherwise change the training in response to the results so these cells remained held out and separate from the training process for all models. The test set is from a separate dataset that may or may not have cells annotated in the same way as the first dataset. We did not test our model or others on this dataset.
We performed this cell annotation analysis solely as a point of comparison between approaches as it was not a major focus for us. If we were to fine-tune an optimal model to annotate lung cells, we would use multiple datasets for the training to ensure generalizability, optimize hyperparameters to ensure optimal learning, and then evaluate on multiple held out datasets. That would ensure the most effective training and most robust predictive potential. We would recommend taking that approach if you’d like to fine-tune the model for that task.