yimingliang's picture
Upload 168 files
9180178 verified
[2024-07-13 23:16:59,457] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-13 23:17:00,897] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-07-13 23:17:00,947] [INFO] [runner.py:571:main] cmd = /ML-A100/team/mm/zhangge/anaconda3/envs/improve/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None ../../src/train_bash.py --deepspeed ../deepspeed/ds_z3_config.json --stage sft --do_train --model_name_or_path /ML-A100/team/mm/eamon/self_instruction/models/Qwen1_5_32B --dataset qwen_32B_d2_iter3_model --dataset_dir ../../data --template qwen_like --finetuning_type lora --lora_target all --lora_rank 8 --lora_alpha 16 --lora_dropout 0.05 --output_dir /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_models/qwen_32B_d2_iter3_model --overwrite_cache --overwrite_output_dir --cutoff_len 1024 --preprocessing_num_workers 8 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 2 --lr_scheduler_type cosine --logging_steps 10 --warmup_steps 20 --save_steps 100 --eval_steps 100 --evaluation_strategy steps --load_best_model_at_end --learning_rate 5e-5 --num_train_epochs 2.0 --max_samples 3000 --val_size 0.1 --ddp_timeout 180000000 --plot_loss --bf16
[2024-07-13 23:17:02,969] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-13 23:17:03,894] [INFO] [launch.py:138:main] 0 NCCL_IB_PCI_RELAXED_ORDERING=1
[2024-07-13 23:17:03,894] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=INFO
[2024-07-13 23:17:03,894] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eth1
[2024-07-13 23:17:03,894] [INFO] [launch.py:138:main] 0 NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1
[2024-07-13 23:17:03,894] [INFO] [launch.py:138:main] 0 NCCL_IB_GID_INDEX=7
[2024-07-13 23:17:03,894] [INFO] [launch.py:138:main] 0 NCCL_IB_RETRY_CNT=7
[2024-07-13 23:17:03,894] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=0
[2024-07-13 23:17:03,894] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.19.3
[2024-07-13 23:17:03,894] [INFO] [launch.py:138:main] 0 NCCL_IB_TIMEOUT=23
[2024-07-13 23:17:03,894] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-07-13 23:17:03,894] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-07-13 23:17:03,894] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-07-13 23:17:03,894] [INFO] [launch.py:163:main] dist_world_size=8
[2024-07-13 23:17:03,894] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-07-13 23:17:09,638] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-13 23:17:09,718] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-13 23:17:09,747] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-13 23:17:09,772] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-13 23:17:09,808] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-13 23:17:09,816] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-13 23:17:10,023] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-13 23:17:10,170] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-13 23:17:11,655] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-13 23:17:11,756] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-13 23:17:11,841] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-13 23:17:11,841] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-07-13 23:17:11,842] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-13 23:17:11,845] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-13 23:17:11,853] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-13 23:17:12,039] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-13 23:17:12,194] [INFO] [comm.py:637:init_distributed] cdb=None
07/13/2024 23:17:12 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
07/13/2024 23:17:12 - INFO - llmtuner.hparams.parser - Process rank: 4, device: cuda:4, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/13/2024 23:17:12 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
07/13/2024 23:17:12 - INFO - llmtuner.hparams.parser - Process rank: 5, device: cuda:5, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/13/2024 23:17:12 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
07/13/2024 23:17:12 - INFO - llmtuner.hparams.parser - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/13/2024 23:17:12 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
07/13/2024 23:17:12 - INFO - llmtuner.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/13/2024 23:17:12 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
07/13/2024 23:17:12 - INFO - llmtuner.hparams.parser - Process rank: 6, device: cuda:6, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/13/2024 23:17:12 - INFO - llmtuner.data.template - Add <|im_end|>,<|endoftext|> to stop words.
07/13/2024 23:17:12 - INFO - llmtuner.data.template - Add <|im_end|>,<|endoftext|> to stop words.
07/13/2024 23:17:12 - INFO - llmtuner.data.template - Add <|im_end|>,<|endoftext|> to stop words.
07/13/2024 23:17:12 - INFO - llmtuner.data.template - Add <|im_end|>,<|endoftext|> to stop words.
07/13/2024 23:17:12 - INFO - llmtuner.data.template - Add <|im_end|>,<|endoftext|> to stop words.
07/13/2024 23:17:12 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
07/13/2024 23:17:12 - INFO - llmtuner.hparams.parser - Process rank: 7, device: cuda:7, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/13/2024 23:17:12 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
07/13/2024 23:17:12 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/13/2024 23:17:12 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
07/13/2024 23:17:12 - INFO - llmtuner.hparams.parser - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/13/2024 23:17:13 - INFO - llmtuner.data.template - Add <|im_end|>,<|endoftext|> to stop words.
07/13/2024 23:17:13 - INFO - llmtuner.data.template - Add <|im_end|>,<|endoftext|> to stop words.
07/13/2024 23:17:13 - INFO - llmtuner.data.loader - Loading dataset /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_outputs/qwen_32B_iter2_d2_10k_output_filtered_evaluated_filtered.json...
07/13/2024 23:17:13 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
07/13/2024 23:17:13 - INFO - llmtuner.data.template - Add <|im_end|>,<|endoftext|> to stop words.
t-20240713214052-lxb45-worker-0:200882:200882 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240713214052-lxb45-worker-0:200882:200882 [0] NCCL INFO Bootstrap : Using eth1:172.25.8.60<0>
t-20240713214052-lxb45-worker-0:200882:200882 [0] NCCL INFO cudaDriverVersion 12030
NCCL version 2.18.1+cuda12.1
t-20240713214052-lxb45-worker-0:200885:200885 [3] NCCL INFO cudaDriverVersion 12030
t-20240713214052-lxb45-worker-0:200889:200889 [7] NCCL INFO cudaDriverVersion 12030
t-20240713214052-lxb45-worker-0:200885:200885 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240713214052-lxb45-worker-0:200889:200889 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240713214052-lxb45-worker-0:200886:200886 [4] NCCL INFO cudaDriverVersion 12030
t-20240713214052-lxb45-worker-0:200886:200886 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240713214052-lxb45-worker-0:200884:200884 [2] NCCL INFO cudaDriverVersion 12030
t-20240713214052-lxb45-worker-0:200884:200884 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240713214052-lxb45-worker-0:200885:200885 [3] NCCL INFO Bootstrap : Using eth1:172.25.8.60<0>
t-20240713214052-lxb45-worker-0:200889:200889 [7] NCCL INFO Bootstrap : Using eth1:172.25.8.60<0>
t-20240713214052-lxb45-worker-0:200886:200886 [4] NCCL INFO Bootstrap : Using eth1:172.25.8.60<0>
t-20240713214052-lxb45-worker-0:200884:200884 [2] NCCL INFO Bootstrap : Using eth1:172.25.8.60<0>
t-20240713214052-lxb45-worker-0:200887:200887 [5] NCCL INFO cudaDriverVersion 12030
t-20240713214052-lxb45-worker-0:200887:200887 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240713214052-lxb45-worker-0:200887:200887 [5] NCCL INFO Bootstrap : Using eth1:172.25.8.60<0>
t-20240713214052-lxb45-worker-0:200883:200883 [1] NCCL INFO cudaDriverVersion 12030
t-20240713214052-lxb45-worker-0:200888:200888 [6] NCCL INFO cudaDriverVersion 12030
t-20240713214052-lxb45-worker-0:200883:200883 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240713214052-lxb45-worker-0:200888:200888 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240713214052-lxb45-worker-0:200883:200883 [1] NCCL INFO Bootstrap : Using eth1:172.25.8.60<0>
t-20240713214052-lxb45-worker-0:200888:200888 [6] NCCL INFO Bootstrap : Using eth1:172.25.8.60<0>
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO P2P plugin IBext
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth1:172.25.8.60<0>
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Using network IBext
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO P2P plugin IBext
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO P2P plugin IBext
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO P2P plugin IBext
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO P2P plugin IBext
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO P2P plugin IBext
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO P2P plugin IBext
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth1:172.25.8.60<0>
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Using network IBext
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth1:172.25.8.60<0>
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Using network IBext
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO P2P plugin IBext
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth1:172.25.8.60<0>
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Using network IBext
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth1:172.25.8.60<0>
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Using network IBext
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth1:172.25.8.60<0>
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Using network IBext
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth1:172.25.8.60<0>
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Using network IBext
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth1:172.25.8.60<0>
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Using network IBext
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO NVLS multicast support is not available on dev 6
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO NVLS multicast support is not available on dev 7
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO NVLS multicast support is not available on dev 2
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO NVLS multicast support is not available on dev 5
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff,00000000
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO NVLS multicast support is not available on dev 4
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO NVLS multicast support is not available on dev 0
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO NVLS multicast support is not available on dev 1
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO NVLS multicast support is not available on dev 3
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO P2P Chunksize set to 524288
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO P2P Chunksize set to 524288
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO P2P Chunksize set to 524288
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 00/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO P2P Chunksize set to 524288
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 01/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO P2P Chunksize set to 524288
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 02/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 03/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO P2P Chunksize set to 524288
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO P2P Chunksize set to 524288
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 04/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 05/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 06/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 07/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 08/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 09/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 10/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 11/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 12/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 13/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 14/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 15/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO P2P Chunksize set to 524288
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 00/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 01/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 02/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 03/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 04/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 05/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 06/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 07/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 08/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 09/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 10/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 11/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 12/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 13/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 14/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 15/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 00/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 01/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 00/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 00/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 02/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 01/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 00/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 01/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 00/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 00/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 00/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 02/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 03/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 01/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 01/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 03/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 04/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 01/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 02/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 01/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 04/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 05/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 02/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 02/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 02/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 03/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 02/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 06/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 05/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 03/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 03/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 04/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 03/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 03/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 06/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 07/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 04/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 04/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 04/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 04/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 05/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 07/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 08/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 05/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 05/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 06/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 05/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 05/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 08/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 09/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 09/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 10/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 10/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 11/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 07/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 06/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 06/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 06/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 06/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 07/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 08/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 07/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 07/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 07/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 11/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 12/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 09/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 08/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 08/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 08/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 12/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 13/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 08/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 10/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 09/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 09/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 09/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 09/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 13/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 14/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 14/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 15/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 10/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 10/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 10/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 15/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 10/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 11/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 11/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 11/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 11/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 12/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 11/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 12/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 12/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 13/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 12/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 13/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 13/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 13/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 14/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 12/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 13/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 14/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 15/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 14/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 14/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 15/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 14/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 15/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 15/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Channel 15/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Connected all rings
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Connected all rings
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Connected all rings
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Connected all rings
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Connected all rings
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Connected all rings
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Connected all rings
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 00/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Connected all rings
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 01/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 02/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 03/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 04/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 05/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 06/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 07/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 08/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 00/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 01/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 02/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 03/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 09/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 10/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 00/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 00/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 01/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 01/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 02/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 02/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 03/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 04/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 05/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 04/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 05/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 06/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 07/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 08/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 09/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 11/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 12/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 03/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 06/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 07/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 04/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 10/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 11/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 00/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 00/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 13/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 01/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 01/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 14/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 08/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 05/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 09/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 06/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 10/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 07/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 12/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 13/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 14/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Channel 15/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 02/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 02/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 03/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 03/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Channel 15/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 04/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 04/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 00/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 08/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 09/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 10/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 11/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 05/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 05/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 06/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 06/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 01/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 07/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 07/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 02/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 11/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 12/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 13/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 14/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 12/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 13/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 14/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Channel 15/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 03/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 08/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 08/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 04/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 09/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 09/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Channel 15/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 10/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 05/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 10/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 11/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 11/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 06/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO Connected all trees
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 12/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 07/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 12/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 13/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 13/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 08/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 14/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 14/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 09/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Channel 15/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Channel 15/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 10/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 11/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 12/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 13/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO Connected all trees
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO Connected all trees
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO Connected all trees
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO Connected all trees
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 14/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Channel 15/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO Connected all trees
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO Connected all trees
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO Connected all trees
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240713214052-lxb45-worker-0:200887:202151 [5] NCCL INFO comm 0xc50e6d0 rank 5 nranks 8 cudaDev 5 busId ca000 commId 0xb2e30db6e1080cb7 - Init COMPLETE
t-20240713214052-lxb45-worker-0:200888:202153 [6] NCCL INFO comm 0xc99b2b0 rank 6 nranks 8 cudaDev 6 busId e0000 commId 0xb2e30db6e1080cb7 - Init COMPLETE
t-20240713214052-lxb45-worker-0:200885:202150 [3] NCCL INFO comm 0xdd50230 rank 3 nranks 8 cudaDev 3 busId 2d000 commId 0xb2e30db6e1080cb7 - Init COMPLETE
t-20240713214052-lxb45-worker-0:200889:202147 [7] NCCL INFO comm 0xd5821d0 rank 7 nranks 8 cudaDev 7 busId e4000 commId 0xb2e30db6e1080cb7 - Init COMPLETE
t-20240713214052-lxb45-worker-0:200884:202149 [2] NCCL INFO comm 0xdb46f90 rank 2 nranks 8 cudaDev 2 busId 29000 commId 0xb2e30db6e1080cb7 - Init COMPLETE
t-20240713214052-lxb45-worker-0:200883:202152 [1] NCCL INFO comm 0xd755740 rank 1 nranks 8 cudaDev 1 busId 13000 commId 0xb2e30db6e1080cb7 - Init COMPLETE
t-20240713214052-lxb45-worker-0:200886:202148 [4] NCCL INFO comm 0xbfc47d0 rank 4 nranks 8 cudaDev 4 busId c5000 commId 0xb2e30db6e1080cb7 - Init COMPLETE
t-20240713214052-lxb45-worker-0:200882:202138 [0] NCCL INFO comm 0xdd152a0 rank 0 nranks 8 cudaDev 0 busId d000 commId 0xb2e30db6e1080cb7 - Init COMPLETE
07/13/2024 23:17:38 - INFO - llmtuner.data.loader - Loading dataset /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_outputs/qwen_32B_iter2_d2_10k_output_filtered_evaluated_filtered.json...
07/13/2024 23:17:38 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
07/13/2024 23:17:38 - INFO - llmtuner.data.loader - Loading dataset /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_outputs/qwen_32B_iter2_d2_10k_output_filtered_evaluated_filtered.json...
07/13/2024 23:17:38 - INFO - llmtuner.data.loader - Loading dataset /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_outputs/qwen_32B_iter2_d2_10k_output_filtered_evaluated_filtered.json...
07/13/2024 23:17:38 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
07/13/2024 23:17:38 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
07/13/2024 23:17:38 - INFO - llmtuner.data.loader - Loading dataset /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_outputs/qwen_32B_iter2_d2_10k_output_filtered_evaluated_filtered.json...
07/13/2024 23:17:38 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
07/13/2024 23:17:38 - INFO - llmtuner.data.loader - Loading dataset /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_outputs/qwen_32B_iter2_d2_10k_output_filtered_evaluated_filtered.json...
07/13/2024 23:17:38 - INFO - llmtuner.data.loader - Loading dataset /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_outputs/qwen_32B_iter2_d2_10k_output_filtered_evaluated_filtered.json...
07/13/2024 23:17:38 - INFO - llmtuner.data.loader - Loading dataset /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_outputs/qwen_32B_iter2_d2_10k_output_filtered_evaluated_filtered.json...
07/13/2024 23:17:38 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
07/13/2024 23:17:38 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
07/13/2024 23:17:38 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
input_ids:
[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 7985, 264, 3364, 911, 264, 12305, 429, 4278, 369, 279, 4126, 315, 15235, 3412, 1874, 13, 151645, 198, 151644, 77091, 198, 12522, 5193, 264, 882, 11, 304, 264, 89156, 3412, 12481, 518, 279, 4126, 315, 15235, 3412, 11, 1052, 572, 264, 12305, 6941, 4892, 1694, 13, 4892, 1694, 572, 264, 1584, 8668, 10603, 37821, 20443, 11229, 6188, 311, 7789, 279, 13923, 323, 11811, 304, 862, 2222, 311, 11912, 279, 2070, 315, 20443, 11229, 382, 49, 10317, 7391, 806, 2849, 3238, 98129, 16263, 279, 2083, 11, 10476, 1105, 311, 2884, 323, 2225, 501, 25185, 11, 30374, 21204, 11, 323, 41018, 821, 13, 1260, 572, 264, 15172, 4462, 315, 279, 2083, 11, 2677, 5527, 311, 38480, 264, 10476, 1424, 323, 3010, 806, 18726, 382, 19810, 1660, 264, 5662, 11, 4892, 1694, 1030, 264, 4911, 17294, 323, 572, 1632, 12, 43236, 553, 279, 11811, 13, 1260, 572, 2677, 23541, 311, 3960, 323, 7269, 5561, 11, 323, 1035, 3545, 2548, 4755, 323, 16579, 304, 20333, 448, 279, 13923, 382, 3966, 1899, 11, 264, 41345, 18335, 572, 1865, 304, 279, 2070, 315, 15235, 429, 1030, 279, 4650, 311, 2297, 279, 1879, 15683, 13, 576, 2083, 572, 97292, 323, 6439, 2163, 279, 8866, 311, 45364, 279, 5440, 323, 4727, 279, 501, 15235, 1849, 382, 49, 10317, 572, 518, 279, 4746, 315, 279, 2390, 11, 3238, 98129, 311, 5978, 429, 279, 501, 15235, 1849, 572, 438, 11050, 323, 7373, 438, 3204, 13, 1260, 572, 12409, 311, 387, 264, 949, 315, 1741, 264, 4910, 54307, 2390, 323, 6476, 264, 5538, 5530, 315, 60138, 979, 279, 501, 15235, 1849, 572, 5499, 38197, 311, 279, 1879, 382, 3830, 429, 1899, 389, 11, 4892, 1694, 6116, 458, 25098, 949, 315, 279, 2083, 11, 10476, 311, 3060, 311, 11912, 279, 2070, 315, 20443, 11229, 13, 1260, 572, 25195, 369, 279, 6638, 311, 975, 16263, 1741, 19752, 20062, 323, 6476, 429, 566, 1030, 1730, 806, 830, 8098, 304, 2272, 13, 151643]
inputs:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Write a story about a robot that works for the center of AI research group.<|im_end|>
<|im_start|>assistant
Once upon a time, in a bustling research facility at the center of AI research, there was a robot named Robby. Robby was a state-of-the-art artificial intelligence designed to assist the scientists and researchers in their quest to advance the field of artificial intelligence.
Robby spent his days working tirelessly alongside the team, helping them to design and develop new algorithms, conducting experiments, and analyzing data. He was a valuable member of the team, always ready to lend a helping hand and offer his expertise.
Despite being a machine, Robby had a unique personality and was well-liked by the researchers. He was always eager to learn and improve himself, and would often ask questions and engage in discussions with the scientists.
One day, a breakthrough discovery was made in the field of AI that had the potential to change the world forever. The team was ecstatic and worked around the clock to refine the technology and perfect the new AI system.
Robby was at the heart of the project, working tirelessly to ensure that the new AI system was as efficient and effective as possible. He was proud to be a part of such a ground-breaking project and felt a deep sense of accomplishment when the new AI system was finally unveiled to the world.
From that day on, Robby became an integral part of the team, helping to continue to advance the field of artificial intelligence. He was grateful for the opportunity to work alongside such brilliant minds and felt that he had found his true calling in life.<|endoftext|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 12522, 5193, 264, 882, 11, 304, 264, 89156, 3412, 12481, 518, 279, 4126, 315, 15235, 3412, 11, 1052, 572, 264, 12305, 6941, 4892, 1694, 13, 4892, 1694, 572, 264, 1584, 8668, 10603, 37821, 20443, 11229, 6188, 311, 7789, 279, 13923, 323, 11811, 304, 862, 2222, 311, 11912, 279, 2070, 315, 20443, 11229, 382, 49, 10317, 7391, 806, 2849, 3238, 98129, 16263, 279, 2083, 11, 10476, 1105, 311, 2884, 323, 2225, 501, 25185, 11, 30374, 21204, 11, 323, 41018, 821, 13, 1260, 572, 264, 15172, 4462, 315, 279, 2083, 11, 2677, 5527, 311, 38480, 264, 10476, 1424, 323, 3010, 806, 18726, 382, 19810, 1660, 264, 5662, 11, 4892, 1694, 1030, 264, 4911, 17294, 323, 572, 1632, 12, 43236, 553, 279, 11811, 13, 1260, 572, 2677, 23541, 311, 3960, 323, 7269, 5561, 11, 323, 1035, 3545, 2548, 4755, 323, 16579, 304, 20333, 448, 279, 13923, 382, 3966, 1899, 11, 264, 41345, 18335, 572, 1865, 304, 279, 2070, 315, 15235, 429, 1030, 279, 4650, 311, 2297, 279, 1879, 15683, 13, 576, 2083, 572, 97292, 323, 6439, 2163, 279, 8866, 311, 45364, 279, 5440, 323, 4727, 279, 501, 15235, 1849, 382, 49, 10317, 572, 518, 279, 4746, 315, 279, 2390, 11, 3238, 98129, 311, 5978, 429, 279, 501, 15235, 1849, 572, 438, 11050, 323, 7373, 438, 3204, 13, 1260, 572, 12409, 311, 387, 264, 949, 315, 1741, 264, 4910, 54307, 2390, 323, 6476, 264, 5538, 5530, 315, 60138, 979, 279, 501, 15235, 1849, 572, 5499, 38197, 311, 279, 1879, 382, 3830, 429, 1899, 389, 11, 4892, 1694, 6116, 458, 25098, 949, 315, 279, 2083, 11, 10476, 311, 3060, 311, 11912, 279, 2070, 315, 20443, 11229, 13, 1260, 572, 25195, 369, 279, 6638, 311, 975, 16263, 1741, 19752, 20062, 323, 6476, 429, 566, 1030, 1730, 806, 830, 8098, 304, 2272, 13, 151643]
labels:
Once upon a time, in a bustling research facility at the center of AI research, there was a robot named Robby. Robby was a state-of-the-art artificial intelligence designed to assist the scientists and researchers in their quest to advance the field of artificial intelligence.
Robby spent his days working tirelessly alongside the team, helping them to design and develop new algorithms, conducting experiments, and analyzing data. He was a valuable member of the team, always ready to lend a helping hand and offer his expertise.
Despite being a machine, Robby had a unique personality and was well-liked by the researchers. He was always eager to learn and improve himself, and would often ask questions and engage in discussions with the scientists.
One day, a breakthrough discovery was made in the field of AI that had the potential to change the world forever. The team was ecstatic and worked around the clock to refine the technology and perfect the new AI system.
Robby was at the heart of the project, working tirelessly to ensure that the new AI system was as efficient and effective as possible. He was proud to be a part of such a ground-breaking project and felt a deep sense of accomplishment when the new AI system was finally unveiled to the world.
From that day on, Robby became an integral part of the team, helping to continue to advance the field of artificial intelligence. He was grateful for the opportunity to work alongside such brilliant minds and felt that he had found his true calling in life.<|endoftext|>
[2024-07-13 23:18:51,494] [INFO] [partition_parameters.py:349:__exit__] finished initializing model - num_params = 771, num_elems = 32.51B
07/13/2024 23:19:22 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
07/13/2024 23:19:22 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
07/13/2024 23:19:22 - INFO - llmtuner.model.utils - Found linear modules: v_proj,o_proj,k_proj,down_proj,up_proj,q_proj,gate_proj
07/13/2024 23:19:22 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
07/13/2024 23:19:22 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
07/13/2024 23:19:22 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
07/13/2024 23:19:22 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
07/13/2024 23:19:22 - INFO - llmtuner.model.utils - Found linear modules: q_proj,v_proj,down_proj,k_proj,up_proj,gate_proj,o_proj
07/13/2024 23:19:22 - INFO - llmtuner.model.utils - Found linear modules: o_proj,k_proj,up_proj,gate_proj,down_proj,v_proj,q_proj
07/13/2024 23:19:22 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
07/13/2024 23:19:22 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
07/13/2024 23:19:22 - INFO - llmtuner.model.utils - Found linear modules: down_proj,o_proj,k_proj,up_proj,gate_proj,q_proj,v_proj
07/13/2024 23:19:22 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
07/13/2024 23:19:22 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
07/13/2024 23:19:22 - INFO - llmtuner.model.utils - Found linear modules: v_proj,o_proj,gate_proj,q_proj,up_proj,down_proj,k_proj
07/13/2024 23:19:22 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
07/13/2024 23:19:22 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
07/13/2024 23:19:22 - INFO - llmtuner.model.utils - Found linear modules: up_proj,down_proj,v_proj,o_proj,gate_proj,q_proj,k_proj
07/13/2024 23:19:22 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
07/13/2024 23:19:22 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
07/13/2024 23:19:22 - INFO - llmtuner.model.utils - Found linear modules: v_proj,down_proj,k_proj,q_proj,o_proj,gate_proj,up_proj
07/13/2024 23:19:23 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
07/13/2024 23:19:23 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
07/13/2024 23:19:23 - INFO - llmtuner.model.utils - Found linear modules: up_proj,gate_proj,o_proj,k_proj,q_proj,v_proj,down_proj
07/13/2024 23:19:46 - INFO - llmtuner.model.loader - trainable params: 66715648 || all params: 32578933760 || trainable%: 0.2048
07/13/2024 23:20:16 - INFO - llmtuner.model.loader - trainable params: 66715648 || all params: 32578933760 || trainable%: 0.2048
07/13/2024 23:20:16 - INFO - llmtuner.model.loader - trainable params: 66715648 || all params: 32578933760 || trainable%: 0.2048
07/13/2024 23:20:16 - INFO - llmtuner.model.loader - trainable params: 66715648 || all params: 32578933760 || trainable%: 0.2048
07/13/2024 23:20:16 - INFO - llmtuner.model.loader - trainable params: 66715648 || all params: 32578933760 || trainable%: 0.2048
07/13/2024 23:20:16 - INFO - llmtuner.model.loader - trainable params: 66715648 || all params: 32578933760 || trainable%: 0.2048
07/13/2024 23:20:16 - INFO - llmtuner.model.loader - trainable params: 66715648 || all params: 32578933760 || trainable%: 0.2048
07/13/2024 23:20:16 - INFO - llmtuner.model.loader - trainable params: 66715648 || all params: 32578933760 || trainable%: 0.2048
[2024-07-13 23:20:16,973] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.0, git-hash=unknown, git-branch=unknown
[2024-07-13 23:20:17,046] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-07-13 23:20:17,053] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-07-13 23:20:17,053] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-07-13 23:20:17,186] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2024-07-13 23:20:17,186] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2024-07-13 23:20:17,186] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2024-07-13 23:20:17,186] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2024-07-13 23:20:17,419] [INFO] [utils.py:791:see_memory_usage] Stage 3 initialize beginning
[2024-07-13 23:20:17,420] [INFO] [utils.py:792:see_memory_usage] MA 8.75 GB Max_MA 11.53 GB CA 10.41 GB Max_CA 20 GB
[2024-07-13 23:20:17,420] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.37 GB, percent = 0.5%
[2024-07-13 23:20:17,435] [INFO] [stage3.py:128:__init__] Reduce bucket size 26214400
[2024-07-13 23:20:17,435] [INFO] [stage3.py:129:__init__] Prefetch bucket size 23592960
[2024-07-13 23:20:17,646] [INFO] [utils.py:791:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-07-13 23:20:17,646] [INFO] [utils.py:792:see_memory_usage] MA 8.75 GB Max_MA 8.75 GB CA 10.41 GB Max_CA 10 GB
[2024-07-13 23:20:17,647] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.41 GB, percent = 0.5%
Parameter Offload: Total persistent parameters: 25760768 in 1025 params
[2024-07-13 23:20:18,419] [INFO] [utils.py:791:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-07-13 23:20:18,420] [INFO] [utils.py:792:see_memory_usage] MA 8.64 GB Max_MA 8.75 GB CA 10.41 GB Max_CA 10 GB
[2024-07-13 23:20:18,421] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.44 GB, percent = 0.5%
[2024-07-13 23:20:18,615] [INFO] [utils.py:791:see_memory_usage] Before creating fp16 partitions
[2024-07-13 23:20:18,615] [INFO] [utils.py:792:see_memory_usage] MA 8.64 GB Max_MA 8.64 GB CA 10.41 GB Max_CA 10 GB
[2024-07-13 23:20:18,616] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.44 GB, percent = 0.5%
[2024-07-13 23:20:19,391] [INFO] [utils.py:791:see_memory_usage] After creating fp16 partitions: 1
[2024-07-13 23:20:19,392] [INFO] [utils.py:792:see_memory_usage] MA 8.64 GB Max_MA 8.64 GB CA 10.16 GB Max_CA 10 GB
[2024-07-13 23:20:19,392] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.57 GB, percent = 0.5%
[2024-07-13 23:20:19,587] [INFO] [utils.py:791:see_memory_usage] Before creating fp32 partitions
[2024-07-13 23:20:19,587] [INFO] [utils.py:792:see_memory_usage] MA 8.64 GB Max_MA 8.64 GB CA 10.16 GB Max_CA 10 GB
[2024-07-13 23:20:19,588] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.57 GB, percent = 0.5%
[2024-07-13 23:20:19,786] [INFO] [utils.py:791:see_memory_usage] After creating fp32 partitions
[2024-07-13 23:20:19,786] [INFO] [utils.py:792:see_memory_usage] MA 8.68 GB Max_MA 8.69 GB CA 10.16 GB Max_CA 10 GB
[2024-07-13 23:20:19,787] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.57 GB, percent = 0.5%
[2024-07-13 23:20:19,987] [INFO] [utils.py:791:see_memory_usage] Before initializing optimizer states
[2024-07-13 23:20:19,988] [INFO] [utils.py:792:see_memory_usage] MA 8.68 GB Max_MA 8.68 GB CA 10.16 GB Max_CA 10 GB
[2024-07-13 23:20:19,989] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.59 GB, percent = 0.5%
[2024-07-13 23:20:20,231] [INFO] [utils.py:791:see_memory_usage] After initializing optimizer states
[2024-07-13 23:20:20,231] [INFO] [utils.py:792:see_memory_usage] MA 8.74 GB Max_MA 8.8 GB CA 10.16 GB Max_CA 10 GB
[2024-07-13 23:20:20,232] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.59 GB, percent = 0.5%
[2024-07-13 23:20:20,232] [INFO] [stage3.py:482:_setup_for_real_optimizer] optimizer state initialized
[2024-07-13 23:20:20,800] [INFO] [utils.py:791:see_memory_usage] After initializing ZeRO optimizer
[2024-07-13 23:20:20,800] [INFO] [utils.py:792:see_memory_usage] MA 8.8 GB Max_MA 8.8 GB CA 10.16 GB Max_CA 10 GB
[2024-07-13 23:20:20,801] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.72 GB, percent = 0.5%
[2024-07-13 23:20:20,801] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2024-07-13 23:20:20,801] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-07-13 23:20:20,802] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-07-13 23:20:20,802] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.999)]
[2024-07-13 23:20:20,811] [INFO] [config.py:984:print] DeepSpeedEngine configuration:
[2024-07-13 23:20:20,812] [INFO] [config.py:988:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-07-13 23:20:20,812] [INFO] [config.py:988:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-07-13 23:20:20,812] [INFO] [config.py:988:print] amp_enabled .................. False
[2024-07-13 23:20:20,812] [INFO] [config.py:988:print] amp_params ................... False
[2024-07-13 23:20:20,812] [INFO] [config.py:988:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-07-13 23:20:20,812] [INFO] [config.py:988:print] bfloat16_enabled ............. True
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] checkpoint_parallel_write_pipeline False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] checkpoint_tag_validation_enabled True
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] checkpoint_tag_validation_fail False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7f22eef23f10>
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] communication_data_type ...... None
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] curriculum_enabled_legacy .... False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] curriculum_params_legacy ..... False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] data_efficiency_enabled ...... False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] dataloader_drop_last ......... False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] disable_allgather ............ False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] dump_state ................... False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] dynamic_loss_scale_args ...... None
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] eigenvalue_enabled ........... False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] eigenvalue_gas_boundary_resolution 1
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] eigenvalue_layer_num ......... 0
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] eigenvalue_max_iter .......... 100
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] eigenvalue_stability ......... 1e-06
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] eigenvalue_tol ............... 0.01
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] eigenvalue_verbose ........... False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] elasticity_enabled ........... False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] fp16_auto_cast ............... None
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] fp16_enabled ................. False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] fp16_master_weights_and_gradients False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] global_rank .................. 0
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] grad_accum_dtype ............. None
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] gradient_accumulation_steps .. 2
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] gradient_clipping ............ 1.0
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] gradient_predivide_factor .... 1.0
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] graph_harvesting ............. False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] initial_dynamic_scale ........ 1
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] load_universal_checkpoint .... False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] loss_scale ................... 1.0
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] memory_breakdown ............. False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] mics_hierarchial_params_gather False
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] mics_shard_size .............. -1
[2024-07-13 23:20:20,813] [INFO] [config.py:988:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] optimizer_legacy_fusion ...... False
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] optimizer_name ............... None
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] optimizer_params ............. None
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] pld_enabled .................. False
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] pld_params ................... False
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] prescale_gradients ........... False
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] scheduler_name ............... None
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] scheduler_params ............. None
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] seq_parallel_communication_data_type torch.float32
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] sparse_attention ............. None
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] sparse_gradients_enabled ..... False
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] steps_per_print .............. inf
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] train_batch_size ............. 16
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] train_micro_batch_size_per_gpu 1
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] use_data_before_expert_parallel_ False
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] use_node_local_storage ....... False
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] wall_clock_breakdown ......... False
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] weight_quantization_config ... None
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] world_size ................... 8
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] zero_allow_untested_optimizer True
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=26214400 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=23592960 param_persistence_threshold=51200 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] zero_enabled ................. True
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] zero_force_ds_cpu_optimizer .. True
[2024-07-13 23:20:20,814] [INFO] [config.py:988:print] zero_optimization_stage ...... 3
[2024-07-13 23:20:20,814] [INFO] [config.py:974:print_user_config] json = {
"train_batch_size": 16,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 2,
"gradient_clipping": 1.0,
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1.000000e+09,
"reduce_bucket_size": 2.621440e+07,
"stage3_prefetch_bucket_size": 2.359296e+07,
"stage3_param_persistence_threshold": 5.120000e+04,
"stage3_max_live_parameters": 1.000000e+09,
"stage3_max_reuse_distance": 1.000000e+09,
"stage3_gather_16bit_weights_on_model_save": true
},
"steps_per_print": inf
}
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Using network IBext
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Using network IBext
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Using network IBext
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Using network IBext
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Using network IBext
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Using network IBext
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Using network IBext
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Using network IBext
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff,00000000
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO NVLS multicast support is not available on dev 4
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO NVLS multicast support is not available on dev 3
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO NVLS multicast support is not available on dev 7
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO NVLS multicast support is not available on dev 0
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO NVLS multicast support is not available on dev 6
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO NVLS multicast support is not available on dev 5
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO NVLS multicast support is not available on dev 1
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO NVLS multicast support is not available on dev 2
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO P2P Chunksize set to 524288
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO P2P Chunksize set to 524288
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 00/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 01/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 02/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 03/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO P2P Chunksize set to 524288
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 04/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO P2P Chunksize set to 524288
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 05/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO P2P Chunksize set to 524288
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 06/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO P2P Chunksize set to 524288
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO P2P Chunksize set to 524288
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 07/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 08/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 09/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 10/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 11/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 12/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 13/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 14/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 15/16 : 0 1 2 3 4 5 6 7
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO P2P Chunksize set to 524288
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 00/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 00/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 00/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 01/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 01/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 02/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 00/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 00/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 02/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 01/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 02/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 03/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 04/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 05/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 06/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 07/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 00/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 01/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 03/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 01/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 00/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 00/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 02/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 03/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 01/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 04/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 02/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 01/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 01/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 04/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 02/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 08/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 09/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 10/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 11/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 12/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 13/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 03/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 04/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 05/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 02/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 05/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 02/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 03/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 03/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 05/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 06/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 03/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 06/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 03/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 04/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 06/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 14/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 15/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 07/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 08/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 09/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 04/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 07/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 04/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 04/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 05/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 10/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 07/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 05/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 08/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 05/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 08/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 06/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 11/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 12/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 05/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 06/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 13/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 06/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 09/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 06/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 07/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 07/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 09/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 07/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 10/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 07/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 08/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 10/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 08/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 14/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Channel 15/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 08/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 09/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 08/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 11/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 09/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 11/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 09/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 10/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 09/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 12/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 10/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 12/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 10/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 13/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 13/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 10/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 11/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 11/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 11/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 14/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 11/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 12/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 14/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 12/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 15/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 12/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 12/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 13/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 13/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 15/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 13/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 13/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 14/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 14/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 15/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 14/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 15/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 14/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 15/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 15/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Connected all rings
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Connected all rings
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Connected all rings
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 00/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Connected all rings
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 01/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Connected all rings
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Connected all rings
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Connected all rings
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Connected all rings
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 02/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 03/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 04/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 05/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 06/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 07/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 08/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 09/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 10/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 11/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 12/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 13/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 14/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Channel 15/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 00/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 01/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 02/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 00/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 00/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 00/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 01/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 01/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 02/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 03/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 04/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 05/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 01/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 06/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 02/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 00/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 00/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 03/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 01/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 03/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 04/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 05/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 02/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 06/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 03/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 07/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 04/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 05/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 06/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 07/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 08/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 09/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 10/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 11/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 07/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 08/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 09/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 02/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 01/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 10/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 03/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 02/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 04/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 03/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 08/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 09/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 10/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 11/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 12/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 13/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 12/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 13/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 14/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Channel 15/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO Connected all trees
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 11/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 12/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 04/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 13/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 05/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 14/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 06/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Channel 15/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 04/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 05/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 07/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 05/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 06/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 06/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 07/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 14/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Channel 15/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO Connected all trees
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 08/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 09/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 10/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 07/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 08/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 08/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 09/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO Connected all trees
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 11/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 12/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 13/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 09/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 10/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 11/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 10/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 14/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Channel 15/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 12/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 11/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 13/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 12/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO Connected all trees
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 13/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 14/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 14/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Channel 15/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Channel 15/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO Connected all trees
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO Connected all trees
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO Connected all trees
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO Connected all trees
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240713214052-lxb45-worker-0:200886:204000 [4] NCCL INFO comm 0x7f974cf88400 rank 4 nranks 8 cudaDev 4 busId c5000 commId 0x329385618a5534b5 - Init COMPLETE
t-20240713214052-lxb45-worker-0:200888:204004 [6] NCCL INFO comm 0x7ff7acf25390 rank 6 nranks 8 cudaDev 6 busId e0000 commId 0x329385618a5534b5 - Init COMPLETE
t-20240713214052-lxb45-worker-0:200882:203999 [0] NCCL INFO comm 0x7f1e76970a70 rank 0 nranks 8 cudaDev 0 busId d000 commId 0x329385618a5534b5 - Init COMPLETE
t-20240713214052-lxb45-worker-0:200885:204001 [3] NCCL INFO comm 0x7f05acf3efe0 rank 3 nranks 8 cudaDev 3 busId 2d000 commId 0x329385618a5534b5 - Init COMPLETE
t-20240713214052-lxb45-worker-0:200883:204002 [1] NCCL INFO comm 0x7f21b8f3e290 rank 1 nranks 8 cudaDev 1 busId 13000 commId 0x329385618a5534b5 - Init COMPLETE
t-20240713214052-lxb45-worker-0:200889:204003 [7] NCCL INFO comm 0x7ff810f765e0 rank 7 nranks 8 cudaDev 7 busId e4000 commId 0x329385618a5534b5 - Init COMPLETE
t-20240713214052-lxb45-worker-0:200884:204006 [2] NCCL INFO comm 0x7f2b9cf8f440 rank 2 nranks 8 cudaDev 2 busId 29000 commId 0x329385618a5534b5 - Init COMPLETE
t-20240713214052-lxb45-worker-0:200887:204005 [5] NCCL INFO comm 0x7f1a88f3e320 rank 5 nranks 8 cudaDev 5 busId ca000 commId 0x329385618a5534b5 - Init COMPLETE
{'loss': 1.5769, 'grad_norm': 2.4350972054221836, 'learning_rate': 2.5e-05, 'epoch': 0.24}
{'loss': 1.0321, 'grad_norm': 1.9409665743940592, 'learning_rate': 5e-05, 'epoch': 0.49}
{'loss': 0.5324, 'grad_norm': 0.6373853834505803, 'learning_rate': 4.685866540361456e-05, 'epoch': 0.73}
{'loss': 0.4983, 'grad_norm': 0.4835656030978697, 'learning_rate': 3.822410025817406e-05, 'epoch': 0.98}
{'loss': 0.4741, 'grad_norm': 1.0227452253571707, 'learning_rate': 2.6266229220967818e-05, 'epoch': 1.22}
{'loss': 0.4467, 'grad_norm': 0.3163433503926468, 'learning_rate': 1.399014621105914e-05, 'epoch': 1.46}
{'loss': 0.4471, 'grad_norm': 0.5876366568682337, 'learning_rate': 4.480913969818098e-06, 'epoch': 1.71}
{'loss': 0.4108, 'grad_norm': 0.6629601353362649, 'learning_rate': 1.2826691520262114e-07, 'epoch': 1.95}
{'train_runtime': 441.8557, 'train_samples_per_second': 2.96, 'train_steps_per_second': 0.186, 'train_loss': 0.6762613843126994, 'epoch': 2.0}
***** train metrics *****
epoch = 2.0
total_flos = 29104GF
train_loss = 0.6763
train_runtime = 0:07:21.85
train_samples_per_second = 2.96
train_steps_per_second = 0.186
Figure saved at: /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_models/qwen_32B_d2_iter3_model/training_loss.png
07/13/2024 23:28:19 - WARNING - llmtuner.extras.ploting - No metric eval_loss to plot.
***** eval metrics *****
epoch = 2.0
eval_loss = 0.601
eval_runtime = 0:00:10.16
eval_samples_per_second = 7.184
eval_steps_per_second = 0.984
[2024-07-13 23:28:31,630] [INFO] [launch.py:347:main] Process 200887 exits successfully.
[2024-07-13 23:28:31,630] [INFO] [launch.py:347:main] Process 200884 exits successfully.
[2024-07-13 23:28:32,631] [INFO] [launch.py:347:main] Process 200885 exits successfully.
[2024-07-13 23:28:32,631] [INFO] [launch.py:347:main] Process 200883 exits successfully.
[2024-07-13 23:28:32,631] [INFO] [launch.py:347:main] Process 200886 exits successfully.
[2024-07-13 23:28:32,631] [INFO] [launch.py:347:main] Process 200888 exits successfully.
[2024-07-13 23:28:32,631] [INFO] [launch.py:347:main] Process 200889 exits successfully.
[2024-07-13 23:28:32,632] [INFO] [launch.py:347:main] Process 200882 exits successfully.