Jun 30, 2023

Thank you for providing Geneformer. When I tried hyperparamter training with the input below:

set training arguments

training_args = {
"do_train": True,
"do_eval": True,
"evaluation_strategy": "steps",
"eval_steps": logging_steps,
"logging_steps": logging_steps,
"group_by_length": True,
"save_steps": 7248,
"length_column_name": "length",
"disable_tqdm": True,
"skip_memory_metrics": True, # memory tracker causes errors in raytune
"per_device_train_batch_size": geneformer_batch_size,
"per_device_eval_batch_size": geneformer_batch_size,
"num_train_epochs": epochs,
"load_best_model_at_end": True, #original true
"output_dir": output_dir,
}

training_args_init = TrainingArguments(**training_args)

create the trainer

trainer = Trainer(
model_init=model_init,
args=training_args_init,
data_collator=DataCollatorForCellClassification(),
train_dataset=classifier_trainset,
eval_dataset=classifier_validset,
compute_metrics=compute_metrics,
)

specify raytune hyperparameter search space

ray_config = {
"num_train_epochs": tune.choice([epochs]),
"learning_rate": tune.loguniform(1e-6, 1e-3),
"weight_decay": tune.uniform(0.0, 0.3),
"lr_scheduler_type": tune.choice(["linear","cosine","polynomial"]),
"warmup_steps": tune.uniform(100, 2000),
"seed": tune.uniform(0,100),
"per_device_train_batch_size": tune.choice([geneformer_batch_size])
}

hyperopt_search = HyperOptSearch(
metric="eval_accuracy", mode="max")

optimize hyperparameters

trainer.hyperparameter_search(
direction="maximize",
backend="ray",
resources_per_trial={"cpu":36,"gpu":1},
hp_space=lambda _: ray_config,
search_alg=hyperopt_search,
n_trials=100, # number of trials
progress_reporter=tune.CLIReporter(max_report_frequency=600,
sort_by_metric=True,
max_progress_rows=100,
mode="max",
metric="eval_accuracy",
metric_columns=["loss", "eval_loss", "eval_accuracy"])
)

The following error occurred. Could you kindly help me locate the problem?

== Status ==
Current time: 2023-06-30 13:08:26 (running for 00:40:00.32)
Using FIFO scheduling algorithm.
Logical resource usage: 36.0/48 CPUs, 1.0/1 GPUs
Result logdir: /root/ray_results/_objective_2023-06-30_12-28-26
Number of trials: 2/100 (1 PENDING, 1 RUNNING)
+---------------------+----------+-----------------------+-----------------+---------------------+--------------------+------------------------+---------+----------------+----------------+
| Trial name | status | loc | learning_rate | lr_scheduler_type | num_train_epochs | per_device_train_bat | seed | warmup_steps | weight_decay |
| | | | | | | ch_size | | | |
|---------------------+----------+-----------------------+-----------------+---------------------+--------------------+------------------------+---------+----------------+----------------|
| _objective_c3280932 | RUNNING | 172.31.110.212:728665 | 6.60963e-06 | polynomial | 1 | 2 | 68.3648 | 1617.92 | 0.150326 |
| _objective_bd8a2bfc | PENDING | | 0.000120052 | linear | 1 | 2 | 17.703 | 1592.29 | 0.2757 |
+---------------------+----------+-----------------------+-----------------+---------------------+--------------------+------------------------+---------+----------------+----------------+

(_objective pid=728665) {'loss': 0.9496, 'learning_rate': 6.0923637356386e-06, 'epoch': 0.1}
(_objective pid=728665) {'eval_runtime': 0.001, 'eval_samples_per_second': 0.0, 'eval_steps_per_second': 0.0, 'epoch': 0.1}
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/pc/miniconda3/lib/python3.10/site-packages/ray/tune/execution/tune_controller.py:815 in │
│ _on_result │
│ │
│ 812 │ │ │ │ │ f"{args}, {kwargs}" │
│ 813 │ │ │ │ ) │
│ 814 │ │ │ │ try: │
│ ❱ 815 │ │ │ │ │ on_result(trial, *args, **kwargs) │
│ 816 │ │ │ │ except Exception as e: │
│ 817 │ │ │ │ │ logger.debug( │
│ 818 │ │ │ │ │ │ f"Error handling {method_name.upper()} result " │
│ │
│ /home/pc/miniconda3/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:735 in │
│ _on_training_result │
│ │
│ 732 │ │ if not isinstance(result, list): │
│ 733 │ │ │ result = [result] │
│ 734 │ │ with warn_if_slow("process_trial_result"): │
│ ❱ 735 │ │ │ self._process_trial_results(trial, result) │
│ 736 │ │ self._maybe_execute_queued_decision(trial) │
│ 737 │ │
│ 738 │ def _process_trial_results(self, trial, results): │
│ │
│ /home/pc/miniconda3/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:748 in │
│ _process_trial_results │
│ │
│ 745 │ │ ): │
│ 746 │ │ │ for i, result in enumerate(results): │
│ 747 │ │ │ │ with warn_if_slow("process_trial_result"): │
│ ❱ 748 │ │ │ │ │ decision = self._process_trial_result(trial, result) │
│ 749 │ │ │ │ if decision is None: │
│ 750 │ │ │ │ │ # If we didn't get a decision, this means a │
│ 751 │ │ │ │ │ # non-training future (e.g. a save) was scheduled. │
│ │
│ /home/pc/miniconda3/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:785 in │
│ _process_trial_result │
│ │
│ 782 │ │ self._total_time += result.get(TIME_THIS_ITER_S, 0) │
│ 783 │ │ │
│ 784 │ │ flat_result = flatten_dict(result) │
│ ❱ 785 │ │ self._validate_result_metrics(flat_result) │
│ 786 │ │ │
│ 787 │ │ if self._stopper(trial.trial_id, result) or trial.should_stop(flat_result): │
│ 788 │ │ │ decision = TrialScheduler.STOP │
│ │
│ /home/pc/miniconda3/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py:883 in │
│ _validate_result_metrics │
│ │
│ 880 │ │ │ │ location = None │
│ 881 │ │ │ │
│ 882 │ │ │ if report_metric: │
│ ❱ 883 │ │ │ │ raise ValueError( │
│ 884 │ │ │ │ │ "Trial returned a result which did not include the " │
│ 885 │ │ │ │ │ "specified metric(s) {} that {} expects. " │
│ 886 │ │ │ │ │ "Make sure your calls to tune.report() include the " │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: Trial returned a result which did not include the specified metric(s) eval_accuracy that SearchGenerator
expects. Make sure your calls to tune.report() include the metric, or set the TUNE_DISABLE_STRICT_METRIC_CHECKING
environment variable to 1. Result: {'objective': None, 'eval_runtime': 0.001, 'eval_samples_per_second': 0.0,
'eval_steps_per_second': 0.0, 'epoch': 0.1, 'time_this_iter_s': 2627.347311973572, 'done': False, 'training_iteration':
1, 'trial_id': 'c3280932', 'date': '2023-06-30_13-12-17', 'timestamp': 1688101937, 'time_total_s': 2627.347311973572,
'pid': 728665, 'hostname': 'DESKTOP-6FHRRIO', 'node_ip': '172.31.110.212', 'time_since_restore': 2627.347311973572,
'iterations_since_restore': 1, 'config/num_train_epochs': 1, 'config/learning_rate': 6.609632605618226e-06,
'config/weight_decay': 0.15032619764660335, 'config/lr_scheduler_type': 'polynomial', 'config/warmup_steps':
1617.919440294964, 'config/seed': 68.36480039671002, 'config/per_device_train_batch_size': 2}

During handling of the above exception, another exception occurred:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in :2 │
│ │
│ /home/pc/miniconda3/lib/python3.10/site-packages/transformers/trainer.py:2628 in │
│ hyperparameter_search │
│ │
│ 2625 │ │ │ HPSearchBackend.SIGOPT: run_hp_search_sigopt, │
│ 2626 │ │ │ HPSearchBackend.WANDB: run_hp_search_wandb, │
│ 2627 │ │ } │
│ ❱ 2628 │ │ best_run = backend_dict[backend](self, n_trials, direction, **kwargs) │
│ 2629 │ │ │
│ 2630 │ │ self.hp_search_backend = None │
│ 2631 │ │ return best_run │
│ │
│ /home/pc/miniconda3/lib/python3.10/site-packages/transformers/integrations.py:353 in │
│ run_hp_search_ray │
│ │
│ 350 │ if hasattr(trainable, "mixins"): │
│ 351 │ │ dynamic_modules_import_trainable.mixins = trainable.mixins │
│ 352 │ │
│ ❱ 353 │ analysis = ray.tune.run( │
│ 354 │ │ dynamic_modules_import_trainable, │
│ 355 │ │ config=trainer.hp_space(None), │
│ 356 │ │ num_samples=n_trials, │
│ │
│ /home/pc/miniconda3/lib/python3.10/site-packages/ray/tune/tune.py:1070 in run │
│ │
│ 1067 │ │ │ while ( │
│ 1068 │ │ │ │ not runner.is_finished() and not experiment_interrupted_event.is_set() │
│ 1069 │ │ │ ): │
│ ❱ 1070 │ │ │ │ runner.step() │
│ 1071 │ │ │ │ if has_verbosity(Verbosity.V1_EXPERIMENT): │
│ 1072 │ │ │ │ │ _report_progress(runner, progress_reporter) │
│ 1073 │
│ │
│ /home/pc/miniconda3/lib/python3.10/site-packages/ray/tune/execution/tune_controller.py:256 in │
│ step │
│ │
│ 253 │ │ self._maybe_add_actors() │
│ 254 │ │ │
│ 255 │ │ # Handle one event │
│ ❱ 256 │ │ if not self._actor_manager.next(timeout=0.1): │
│ 257 │ │ │ # If there are no actors running, warn about potentially │
│ 258 │ │ │ # insufficient resources │
│ 259 │ │ │ if not self._actor_manager.num_live_actors: │
│ │
│ /home/pc/miniconda3/lib/python3.10/site-packages/ray/air/execution/_internal/actor_manager.py:22 │
│ 4 in next │
│ │
│ 221 │ │ if future in actor_state_futures: │
│ 222 │ │ │ self._actor_state_events.resolve_future(future) │
│ 223 │ │ elif future in actor_task_futures: │
│ ❱ 224 │ │ │ self._actor_task_events.resolve_future(future) │
│ 225 │ │ else: │
│ 226 │ │ │ self._handle_ready_resource_future() │
│ 227 │ │ │ # Ready resource futures don't count as one event as they don't trigger │
│ │
│ /home/pc/miniconda3/lib/python3.10/site-packages/ray/air/execution/_internal/event_manager.py:11 │
│ 8 in resolve_future │
│ │
│ 115 │ │ │ │ raise e │
│ 116 │ │ else: │
│ 117 │ │ │ if on_result: │
│ ❱ 118 │ │ │ │ on_result(result) │
│ 119 │ │
│ 120 │ def wait( │
│ 121 │ │ self, │
│ │
│ /home/pc/miniconda3/lib/python3.10/site-packages/ray/air/execution/_internal/actor_manager.py:75 │
│ 2 in on_result │
│ │
│ 749 │ │ │ ) from e │
│ 750 │ │ │
│ 751 │ │ def on_result(result: Any): │
│ ❱ 752 │ │ │ self._actor_task_resolved( │
│ 753 │ │ │ │ tracked_actor_task=tracked_actor_task, result=result │
│ 754 │ │ │ ) │
│ 755 │
│ │
│ /home/pc/miniconda3/lib/python3.10/site-packages/ray/air/execution/_internal/actor_manager.py:30 │
│ 0 in _actor_task_resolved │
│ │
│ 297 │ │ │
│ 298 │ │ # Trigger actor task result callback │
│ 299 │ │ if tracked_actor_task._on_result: │
│ ❱ 300 │ │ │ tracked_actor_task._on_result(tracked_actor, result) │
│ 301 │ │
│ 302 │ def _handle_ready_resource_future(self): │
│ 303 │ │ """Handle a resource future that became ready. │
│ │
│ /home/pc/miniconda3/lib/python3.10/site-packages/ray/tune/execution/tune_controller.py:824 in │
│ _on_result │
│ │
│ 821 │ │ │ │ │ if e is TuneError or self._fail_fast == self.RAISE: │
│ 822 │ │ │ │ │ │ raise e │
│ 823 │ │ │ │ │ else: │
│ ❱ 824 │ │ │ │ │ │ raise TuneError(traceback.format_exc()) │
│ 825 │ │ │
│ 826 │ │ if on_error: │
│ 827 │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TuneError: Traceback (most recent call last):
File "/home/pc/miniconda3/lib/python3.10/site-packages/ray/tune/execution/tune_controller.py", line 815, in _on_result
on_result(trial, *args, **kwargs)
File "/home/pc/miniconda3/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py", line 735, in
_on_training_result
self._process_trial_results(trial, result)
File "/home/pc/miniconda3/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py", line 748, in
_process_trial_results
decision = self._process_trial_result(trial, result)
File "/home/pc/miniconda3/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py", line 785, in
_process_trial_result
self._validate_result_metrics(flat_result)
File "/home/pc/miniconda3/lib/python3.10/site-packages/ray/tune/execution/trial_runner.py", line 883, in
_validate_result_metrics
raise ValueError(
ValueError: Trial returned a result which did not include the specified metric(s) eval_accuracy that SearchGenerator
expects. Make sure your calls to tune.report() include the metric, or set the TUNE_DISABLE_STRICT_METRIC_CHECKING
environment variable to 1. Result: {'objective': None, 'eval_runtime': 0.001, 'eval_samples_per_second': 0.0,
'eval_steps_per_second': 0.0, 'epoch': 0.1, 'time_this_iter_s': 2627.347311973572, 'done': False, 'training_iteration':
1, 'trial_id': 'c3280932', 'date': '2023-06-30_13-12-17', 'timestamp': 1688101937, 'time_total_s': 2627.347311973572,
'pid': 728665, 'hostname': 'DESKTOP-6FHRRIO', 'node_ip': '172.31.110.212', 'time_since_restore': 2627.347311973572,
'iterations_since_restore': 1, 'config/num_train_epochs': 1, 'config/learning_rate': 6.609632605618226e-06,
'config/weight_decay': 0.15032619764660335, 'config/lr_scheduler_type': 'polynomial', 'config/warmup_steps':
1617.919440294964, 'config/seed': 68.36480039671002, 'config/per_device_train_batch_size': 2}

pchiang5

Jun 30, 2023

BTW, there was another error below when I ran the native code in the notebook:

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in :20 │
│ in init:111 │
│ │
│ /home/pc/miniconda3/lib/python3.10/site-packages/transformers/training_args.py:1251 in │
│ post_init │
│ │
│ 1248 │ │ │ │ │ │ │ "--load_best_model_at_end requires the saving steps to be a │
│ 1249 │ │ │ │ │ │ │ f"steps, but found {self.save_steps}, which is not a multipl │
│ 1250 │ │ │ │ │ │ ) │
│ ❱ 1251 │ │ │ │ raise ValueError( │
│ 1252 │ │ │ │ │ "--load_best_model_at_end requires the saving steps to be a round mu │
│ 1253 │ │ │ │ │ f"steps, but found {self.save_steps}, which is not a round multiple │
│ 1254 │ │ │ │ ) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
ValueError: --load_best_model_at_end requires the saving steps to be a round multiple of the evaluation steps, but found
453, which is not a round multiple of 7248.

The error could be bypassed by adding "save_steps": 7248 to "training_args". The original ValueError: Trial returned a result which did not include the specified metric(s) eval_accuracy that SearchGenerator expects. Make sure your calls to tune.report() include the metric, or set the TUNE_DISABLE_STRICT_METRIC_CHECKING environment variable to 1. happened after adding that line.

ctheodoris

Owner Jun 30, 2023

Thank you for your interest in Geneformer. The first error appears to be due to the name in the provided metrics being different than expected by RayTune so I would suggest determining the name it should look for and providing that accordingly. The second error can be solved based on the suggestion Huggingface provides in the error message.

ctheodoris changed discussion status to closed Jun 30, 2023

pchiang5

Jul 1, 2023

Thank you for your answers. However, I followed the notebook "hyperparam_optimiz_for_disease_classifier.py" and used your DCM dataset as the test. Moreover, I noted the use of accuracy': acc and not `eval_accuracy': acc' in the following section:

def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
# calculate accuracy using sklearn's function
acc = accuracy_score(labels, preds)
return {
'accuracy': acc,
}

So I changed to eval_accuracy and set "save_steps": 7248, but the identical error message jumped out. Do I have to indicate the 'eval_accuracy' otherwhere as the input for SearchGenerator?

pchiang5 changed discussion status to open Jul 1, 2023

ctheodoris

Owner Jul 1, 2023

Thank you for your question. The example we provided is exactly how we performed the analysis. As you note, the names are different in the definition of the compute_metrics and in the variable name passed to RayTune. We had to arrange it that way because the name was altered from what we had labeled it as. If in your versions it is a different name, you should do the same and determine what the variable is named so that you can indicate it correctly to RayTune. There are a few places within the code where this referenced so I would suggest you modify it in all of those locations. For the "save_steps", it would be more robust to make it a multiple of "logging_steps".

ctheodoris changed discussion status to closed Jul 1, 2023

pchiang5

Jul 2, 2023

Thank you. I got it to work by using the organ_trainset and organ_evalset (unmodified) in your cell_classification.ipynb as the input for the hyperparameter optimization. I guess somewhere was messed up during the label assignment with my original trials.

ctheodoris
/

Geneformer

error: `eval_accuracy` not included

set training arguments

create the trainer

specify raytune hyperparameter search space

optimize hyperparameters