incorporate new datasets
Hello. Suppose I have 2 new datasets of in vitro embryoid bodies, one with annotation (dataset A) and the other without (B). Dataset A is also quite different from dataset B (say, from different species with gene names inferred by orthologous mapping). Is it possible to incorporate their information into your pretrained models by self-supervised training (no cell-type annotation)? After that, I can use dataset A to fine-tune for cell classification and finally infer the cell types in dataset B.
Will this way improve the classification of dataset B? Thank you.
Thank you for your interest in Geneformer! It is certainly possible to extend the pretraining with additional datasets to allow the model to learn from their network dynamics. You could compare performance on a validation set to see if it’s a beneficial approach for your specific application or whether directly fine-tuning for cell annotation with the labeled dataset A is sufficient.