runtime error

Exit code: 1. Reason: iprocessing/api.py:869] failed (exitcode: 1) local_rank: 1 (pid: 77) of binary: /usr/bin/python Traceback (most recent call last): File "/usr/local/bin/accelerate", line 8, in <module> sys.exit(main()) ^^^^^^ File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 1159, in launch_command multi_gpu_launcher(args) File "/usr/local/lib/python3.11/dist-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher distrib_run.run(args) File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ finetune_script.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2024-12-23_03:39:43 host : r-daresearch-train-70b-4bit-t7rg43g4-dcf5a-1wnlf rank : 3 (local_rank: 3) exitcode : 1 (pid: 79) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-12-23_03:39:43 host : r-daresearch-train-70b-4bit-t7rg43g4-dcf5a-1wnlf rank : 1 (local_rank: 1) exitcode : 1 (pid: 77) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

Container logs:

Fetching error logs...