Unable to Reproduce Results for Gene Classification

#425
by mchatz - opened

Hey,

I am unable to reproduce the results for gene classification using the default settings from the provided notebook https://huggingface.co/ctheodoris/Geneformer/blob/main/examples/gene_classification.ipynb for the dosage sensitive task
Specifically, I am using the 6-layer Geneformer and the example input data from https://huggingface.co/datasets/ctheodoris/Genecorpus-30M/tree/main/example_input_files/gene_classification/dosage_sensitive_tfs/gc-30M_sample50k.dataset.

Issue:

I am getting a macro F1 score of 0.672, which is lower than expected.

The model is biased toward predicting the second class.
image.png

Please let me know if there are any suggestions or if additional configuration is required,

Thank you in advance.

Thank you for your question! If you are using the current version please note the default dictionary is for the 95M model so you need to provide the 30M dictionary for the 30M model. Otherwise the tokens will be scrambled from their true gene identity.

ctheodoris changed discussion status to closed

Sign up or log in to comment