Hello,

I have 2 x A100 GPUs with a total of 160GB VRAM. I attempted to fine-tune the OpenGVLab/InternVL-Chat-V1-5 model by following the steps provided in the InternVL documentation. However, when I ran the script in shell mode, I encountered the following error.

(ft-env) root@07c329c33692:/workspace/ft_InternVL/InternVL/internvl_chat# GPUS=2 PER_DEVICE_BATCH_SIZE=2 sh shell/internvl1.5/2nd_finetune/internvl_chat_v1_5_internlm2_20b_dynamic_res_2nd_finetune_lora.sh
Traceback (most recent call last):
File "/workspace/ft_InternVL/ft-env/bin/torchrun", line 8, in
sys.exit(main())
File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, kwargs)
File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call**
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/ft_InternVL/ft-env/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

Failures:

Root Cause (first observed failure):
[0]:
time : 2025-03-18_10:26:05
host : 5adc298fec1f
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 272919)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 272919

OpenGVLab
/

InternVL-Chat-V1-5

Segmentation fault (SIGSEGV, exit code -11)

Failures:

Root Cause (first observed failure):
[0]:
time : 2025-03-18_10:26:05
host : 5adc298fec1f
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 272919)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 272919

Segmentation fault (SIGSEGV, exit code -11)

Failures:

Root Cause (first observed failure):[0]: time : 2025-03-18_10:26:05 host : 5adc298fec1f rank : 0 (local_rank: 0) exitcode : -11 (pid: 272919) error_file: <N/A> traceback : Signal 11 (SIGSEGV) received by PID 272919

Root Cause (first observed failure):
[0]:
time : 2025-03-18_10:26:05
host : 5adc298fec1f
rank : 0 (local_rank: 0)
exitcode : -11 (pid: 272919)
error_file: <N/A>
traceback : Signal 11 (SIGSEGV) received by PID 272919