Errors from the new version of the geneformer (classifier.py)

#394
by bsb0613 - opened

Thanks for the great package and maintenance!
I just have a few questions regarding the new version of the geneformer.

Recently, I downloaded and installed the most current version of the geneformer.
However, I encountered an issue when trying to run the "Cell classification" example. I used all default example files and scripts.

When using validate on Classifier (cc) with cc.validate, the error occurred with following message

Traceback (most recent call last):
File "", line 1, in
File "/home/bsb0613/Geneformer/geneformer/classifier.py", line 785, in validate
trainer = self.train_classifier(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/bsb0613/Geneformer/geneformer/classifier.py", line 1245, in train_classifier
token_dictionary=self.token_dictionary
^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Classifier' object has no attribute 'token_dictionary'. Did you mean: 'token_dictionary_file'?

I think there might be a problem with the new version of "DataCollatorForCellClassification" for which it seems like you were trying to add options to take custom token dictionary instead of the default one ("TOKEN_DICTIONARY_FILE").

So it seems like some functions in "collator_for_classification.py" and "classifier.py" were altered to take the custom dictionary during which there might be some problems with variable names or how the files were loaded (e.g. Classifier class doesn't have 'token_dictionary' but just 'token_dictionary_file' which is supposed to be a file path loaded to make 'gene_token_dict').

I tried changing variable names or tried loading token_dictionary from the beginning like the previous versions, but sometimes doing so generated another errors from other parts of the scripts.

Could you please check whether the new updates generated any issues?

Thank you!

Thank you for pointing this out! Pushed a change that should resolve this.

ctheodoris changed discussion status to closed

Thanks for the quick reply! I just re-downloaded all the files from the huggingface and re-generated environment for geneformer.

However, another error (same one that I faced when I tried to fix the classifier.py myself) occurred at the same function (cc.validate) with the following message;

Traceback (most recent call last):
File "", line 1, in
File "/home/bsb0613/Geneformer_20240826/geneformer/classifier.py", line 785, in validate
trainer = self.train_classifier(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/bsb0613/Geneformer_20240826/geneformer/classifier.py", line 1244, in train_classifier
data_collator = DataCollatorForCellClassification(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bsb0613/Geneformer_20240826/geneformer/collator_for_classification.py", line 614, in init
tokenizer=PrecollatorForGeneAndCellClassification(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bsb0613/Geneformer_20240826/geneformer/collator_for_classification.py", line 88, in init
self.mask_token_id = self.token_dictionary.get("

Do you think it could be a problem with transformer version that I installed?

Thank you again.

Thank you for following up. It seems your error message was cut off - could you add the rest? Also could you let us know the arguments you are using to call the function? Thank you!

Sorry for cutting off the error message.

Here is the full error:
20240826_172605.png

Traceback (most recent call last):
File "", line 1, in
File "/home/bsb0613/Geneformer_20240826/geneformer/classifier.py", line 785, in validate
trainer = self.train_classifier(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/bsb0613/Geneformer_20240826/geneformer/classifier.py", line 1244, in train_classifier
data_collator = DataCollatorForCellClassification(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bsb0613/Geneformer_20240826/geneformer/collator_for_classification.py", line 614, in init
tokenizer=PrecollatorForGeneAndCellClassification(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/bsb0613/Geneformer_20240826/geneformer/collator_for_classification.py", line 88, in init
self.mask_token_id = self.token_dictionary.get("mask") # "mask" is supposed to have "<>" around it but using that actually masked the comment so I converted only for here
^^^^^^^^^^^^^^^^^^
File "/home/bsb0613/miniconda3/envs/geneformer4/lib/python3.12/site-packages/transformers/tokenization_utils_base.py", line 1290, in mask_token_id
self._mask_token = self.convert_ids_to_tokens(value) if value is not None else None
^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'PrecollatorForGeneAndCellClassification' object has no attribute 'convert_ids_to_tokens'

As the arguments and inputs, I was following the same code in "examples/cell_classification.ipynb" with "human_dcm_hcm_nf.dataset" from GeneCorpus-30M as input data file and "gf-6L-30M-i2048" from geneformer as the model.

The exact function that caused the such error was:

all_metrics = cc.validate(model_directory="/home/bsb0613/Geneformer/gf-6L-30M-i2048/",
prepared_input_data_file=f"{output_dir}/{output_prefix}_labeled_train.dataset",
id_class_dict_file=f"{output_dir}/{output_prefix}_id_class_dict.pkl",
output_directory=output_dir,
output_prefix=output_prefix,
split_id_dict=train_valid_id_split_dict)

Thank you - that is helpful. Could you tell us the version of transformers you have installed?

I am currently using transformer 4.44.2. All the packages except for tensorboard which I somehow had to install after installing geneformer were installed through git cloning geneformer and installing through the clone.

I am having all the same issues as bsb0613.

For what it's worth, I tried pip uninstalling transformers, reinstalling with the command
pip install "transformers>4.28,<4.44"
which yielded transformers version 4.43.4. But unfortunately the errors persist i.e. "has no attribute 'convert_ids_to_tokens'".

So my hunch is this issue may not be limited to the version of transformers?

I also tried to fix it by changing the transformers version, but the issue persisted. Do you have any idea how to fix it? Please count my comment as another verification of the issue

Thanks for pointing out the errors; we pushed changes to resolve these issues. Please pull the most recent version and let us know!

Updated, and it appears to work now. Thank you!

Sign up or log in to comment