gene_classification.ipynb: error with no_cuda = True
Hello
Because my graphic device is of limited RAM, I tried to test if the code could run with CPU at the expense of time with the following argument:
GPU_NUMBER = [0]
os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ["NCCL_DEBUG"] = "INFO"
predict_logits += [torch.squeeze(outputs.logits.to("cpu"))]
predict_labels += [torch.squeeze(label_batch.to("cpu"))]
model = model.to("cpu")
training_args = {
"learning_rate": max_lr,
"do_train": True,
"evaluation_strategy": "no",
"save_strategy": "epoch",
"logging_steps": 100,
"group_by_length": True,
"length_column_name": "length",
"disable_tqdm": False,
"lr_scheduler_type": lr_schedule_fn,
"warmup_steps": warmup_steps,
"fp16": False,
"bf16": True,
"fp16_full_eval": False,
"no_cuda": True,
"half_precision_backend": "cpu_amp",
"weight_decay": 0.001,
"per_device_train_batch_size": geneformer_batch_size,
"per_device_eval_batch_size": geneformer_batch_size,
"num_train_epochs": epochs,
}
However, I got the following error:
File /home/pc/miniconda3/envs/geneformer/lib/python3.10/site-packages/transformers/training_args.py:1340, in TrainingArguments.post_init(self)
1334 if version.parse(version.parse(torch.version).base_version) == version.parse("2.0.0") and self.fp16:
1335 raise ValueError("--optim adamw_torch_fused with --fp16 requires PyTorch>2.0")
1337 if (
1338 self.framework == "pt"
1339 and is_torch_available()
-> 1340 and (self.device.type != "cuda")
1341 and (get_xla_device_type(self.device) != "GPU")
1342 and (self.fp16 or self.fp16_full_eval)
1343 ):
1344 raise ValueError(
1345 "FP16 Mixed precision training with AMP or APEX (--fp16
) and FP16 half precision evaluation"
1346 " (--fp16_full_eval
) can only be used on CUDA devices."
1347 )
1349 if (
1350 self.framework == "pt"
1351 and is_torch_available()
(...)
1356 and (self.bf16 or self.bf16_full_eval)
1357 ):
File /home/pc/miniconda3/envs/geneformer/lib/python3.10/site-packages/transformers/training_args.py:1764, in TrainingArguments.device(self)
1760 """
1761 The device used by this process.
1762 """
1763 requires_backends(self, ["torch"])
-> 1764 return self._setup_devices
File /home/pc/miniconda3/envs/geneformer/lib/python3.10/site-packages/transformers/utils/generic.py:54, in cached_property.get(self, obj, objtype)
52 cached = getattr(obj, attr, None)
53 if cached is None:
---> 54 cached = self.fget(obj)
55 setattr(obj, attr, cached)
56 return cached
File /home/pc/miniconda3/envs/geneformer/lib/python3.10/site-packages/transformers/training_args.py:1672, in TrainingArguments._setup_devices(self)
1670 if not is_sagemaker_mp_enabled():
1671 if not is_accelerate_available(min_version="0.20.1"):
-> 1672 raise ImportError(
1673 "Using the Trainer
with PyTorch
requires accelerate>=0.20.1
: Please run pip install transformers[torch]
or pip install accelerate -U
"
1674 )
1675 AcceleratorState._reset_state(reset_partial_state=True)
1676 self.distributed_state = None
ImportError: Using the Trainer
with PyTorch
requires accelerate>=0.20.1
: Please run pip install transformers[torch]
or pip install accelerate -U
I have tried installing accelerate==0.20.1 and it has not worked out. How shall I resolve the issue? Thank you.
Thank you for your interest in Geneformer. We have not tried fine-tuning the model on CPUs, but based on your error message, it could be that itโs because bf16 is not available for training on CPUs. However, this error is coming from Huggingface transformers so we would suggest you check their documentation to learn more. If you do find out the solution, please feel free to post it here to help others who may have a similar question in the future.
Thank you for your reply.
I found it worked by removing the following block in the training_args.py in the geneformer. Probably "is_torch_available()" conflicted with "(self.device.type != "cuda")" even I set ""no_cuda": True" in the training_args.
# if (
# self.framework == "pt"
# and is_torch_available()
# and (self.device.type != "cuda")
# and (get_xla_device_type(self.device) != "GPU")
# and (self.fp16 or self.fp16_full_eval)
# ):
# raise ValueError(
# "FP16 Mixed precision training with AMP or APEX (`--fp16`) and FP16 half precision evaluation"
# " (`--fp16_full_eval`) can only be used on CUDA devices."
# )
After a while, another error 'tensor size mismatch' occurred as below with panglao_SRA553822-SRS2119548.dataset. Was it associated with the use of CPU as well? Thank you.
In [20]: # cross-validate gene classifier
...: all_roc_auc, roc_auc, roc_auc_sd, mean_fpr, mean_tpr, confusion, label_dicts
...: = cross_validate(subsampled_train_dataset, targets, labels, nsplits, subsample_size, tr
...: aining_args, freeze_layers, training_output_dir, 1)
0it [00:00, ?it/s]
****** Crossval split: 0/4 ******
Filtering training data
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-2fbdb7063800f6bd.arrow
Filtered 50%; 14897 remain
Filtering evalation data
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-53fef5c1bf2aa95d.arrow
Filtered 74%; 7860 remain
Labeling training data
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-04592d8012bd56bc.arrow
Labeling evaluation data
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-553f2b895bb22c3b.arrow
Labeling evaluation OOS data
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-5342eb45c1ad0501.arrow
Some weights of the model checkpoint at ctheodoris/Geneformer were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at ctheodoris/Geneformer and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
/home/pc/miniconda3/envs/geneformer/lib/python3.10/site-packages/transformers/optimization.py:411: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or setno_deprecation_warning=True
to disable this warning
warnings.warn(
/home/pc/miniconda3/envs/geneformer/lib/python3.10/site-packages/geneformer/collator_for_classification.py:581: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
batch = {k: torch.tensor(v, dtype=torch.int64) for k, v in batch.items()}
{'loss': 0.7035, 'learning_rate': 1e-05, 'epoch': 0.12}
{'loss': 0.6153, 'learning_rate': 2e-05, 'epoch': 0.24}
{'loss': 0.5162, 'learning_rate': 3e-05, 'epoch': 0.36}
{'loss': 0.3814, 'learning_rate': 4e-05, 'epoch': 0.48}
{'loss': 0.287, 'learning_rate': 5e-05, 'epoch': 0.6}
{'loss': 0.2037, 'learning_rate': 3.502994011976048e-05, 'epoch': 0.72}
{'loss': 0.1862, 'learning_rate': 2.0059880239520957e-05, 'epoch': 0.84}
{'loss': 0.1583, 'learning_rate': 5.0898203592814375e-06, 'epoch': 0.96}
{'train_runtime': 240.3808, 'train_samples_per_second': 41.601, 'train_steps_per_second': 3.469, 'train_loss': 0.37374645285755037, 'epoch': 1.0}
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 834/834 [04:00<00:00, 3.47it/s]
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-b509cef10cae8da3.arrow
^[[B ---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[20], line 3
1 # cross-validate gene classifier
2 all_roc_auc, roc_auc, roc_auc_sd, mean_fpr, mean_tpr, confusion, label_dicts
----> 3 = cross_validate(subsampled_train_dataset, targets, labels, nsplits, subsample_size, training_args, freeze_layers, training_output_dir, 1)
Cell In[14], line 128, in cross_validate(data, targets, labels, nsplits, subsample_size, training_args, freeze_layers, output_dir, num_proc)
125 trainer.save_model(ksplit_model_dir)
127 # evaluate model
--> 128 fpr, tpr, interp_tpr, conf_mat = classifier_predict(trainer.model, evalset_oos_labeled, 200, mean_fpr)
130 # append to tpr and roc lists
131 confusion = confusion + conf_mat
Cell In[13], line 38, in classifier_predict(model, evalset, forward_batch_size, mean_fpr)
35 predict_logits += [torch.squeeze(outputs.logits.to("cpu"))]
36 predict_labels += [torch.squeeze(label_batch.to("cpu"))]
---> 38 logits_by_cell = torch.cat(predict_logits)
39 all_logits = logits_by_cell.reshape(-1, logits_by_cell.shape[2])
40 labels_by_cell = torch.cat(predict_labels)
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1062 but got size 1268 for tensor number 1 in the list.
Thank you for following up. This appears similar to closed issue #31. Have you pulled the updated version? If not, please try that.
Thank you for the suggestion.
I tried the updated version, but a similar error occurred below:
mnt/c/Users/pc/Downloads/geneformer/geneformer/collator_for_classification.py:581: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
batch = {k: torch.tensor(v, dtype=torch.int64) for k, v in batch.items()}
{'loss': 0.6515, 'learning_rate': 1e-05, 'epoch': 0.12}
{'loss': 0.5868, 'learning_rate': 2e-05, 'epoch': 0.24}
{'loss': 0.4944, 'learning_rate': 3e-05, 'epoch': 0.36}
{'loss': 0.3692, 'learning_rate': 4e-05, 'epoch': 0.48}
{'loss': 0.2849, 'learning_rate': 5e-05, 'epoch': 0.6}
{'loss': 0.2078, 'learning_rate': 3.502994011976048e-05, 'epoch': 0.72}
{'loss': 0.1842, 'learning_rate': 2.0059880239520957e-05, 'epoch': 0.84}
{'loss': 0.1596, 'learning_rate': 5.0898203592814375e-06, 'epoch': 0.96}
{'train_runtime': 257.4132, 'train_samples_per_second': 38.848, 'train_steps_per_second': 3.24, 'train_loss': 0.3598546204235342, 'epoch': 1.0}
100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 834/834 [04:17<00:00, 3.24it/s]
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-b509cef10cae8da3.arrow7<00:00, 9.90it/s]
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-c02fb696b3d5e061.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-d74bd43d854bd585.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-ef0a8d586fb36dfe.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-2299ba01f3f3324a.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-ec26febb779ef58e.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-d9ba779b30352d3b.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-1cae902e9833d530.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-eef97f1e3ce741ff.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-e0aa69b5c969bbb0.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-c79eff583fd31baa.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-d98e737501b6e111.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-a07fe905fe45fec6.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-4830ad31848f15f1.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-f30f4f93e63a369f.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-45377aad03d16995.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-2ecda395f441d64e.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-829ae39672d6280d.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-bfe0c60751641870.arrow
Loading cached processed dataset at /mnt/c/Users/pc/Downloads/GS_example/cache-13e2b37bb38c0512.arrow
โญโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ Traceback (most recent call last) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ in :3 โ
โ in cross_validate:128 โ
โ in classifier_predict:38 โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 1062 but got size 1268 for tensor number
1 in the list.
Thank you for following up. Could you please run a diff between your notebook and the current one in this repository to confirm it is up to date?
For example, in your notebook, is your variable padded_batch defined as follows?:
padded_batch = preprocess_classifier_batch(batch_evalset, max_evalset_len)
If it's up to date, please provide the information of anything you changed in the notebook so I can try to reproduce the error because I am not encountering this error when I run the current notebook.
Thank you for your reminder. You are right. I just updated the package but forgot to do the notebook as well. Now it worked.