yimingliang's picture
Upload 168 files
9180178 verified
[2024-07-16 10:20:52,623] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-16 10:20:54,599] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-07-16 10:20:54,670] [INFO] [runner.py:571:main] cmd = /ML-A100/team/mm/zhangge/anaconda3/envs/improve/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None ../../src/train_bash.py --deepspeed ../deepspeed/ds_z3_config.json --stage sft --do_train --model_name_or_path /ML-A100/team/mm/eamon/self_instruction/models/Qwen1_5_32B --dataset qwen_32B_d4_iter5_model --dataset_dir ../../data --template qwen_like --finetuning_type lora --lora_target all --lora_rank 8 --lora_alpha 16 --lora_dropout 0.05 --output_dir /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_models/qwen_32B_d4_iter5_model --overwrite_cache --overwrite_output_dir --cutoff_len 1024 --preprocessing_num_workers 8 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 2 --lr_scheduler_type cosine --logging_steps 10 --warmup_steps 20 --save_steps 100 --eval_steps 100 --evaluation_strategy steps --load_best_model_at_end --learning_rate 5e-5 --num_train_epochs 2.0 --max_samples 3000 --val_size 0.1 --ddp_timeout 180000000 --plot_loss --bf16
[2024-07-16 10:20:56,675] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-16 10:20:57,566] [INFO] [launch.py:138:main] 0 NCCL_IB_PCI_RELAXED_ORDERING=1
[2024-07-16 10:20:57,566] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=INFO
[2024-07-16 10:20:57,566] [INFO] [launch.py:138:main] 0 NCCL_SOCKET_IFNAME=eth1
[2024-07-16 10:20:57,566] [INFO] [launch.py:138:main] 0 NCCL_IB_HCA=mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1
[2024-07-16 10:20:57,566] [INFO] [launch.py:138:main] 0 NCCL_IB_GID_INDEX=7
[2024-07-16 10:20:57,566] [INFO] [launch.py:138:main] 0 NCCL_IB_RETRY_CNT=7
[2024-07-16 10:20:57,566] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=0
[2024-07-16 10:20:57,566] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.19.3
[2024-07-16 10:20:57,566] [INFO] [launch.py:138:main] 0 NCCL_IB_TIMEOUT=23
[2024-07-16 10:20:57,566] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2024-07-16 10:20:57,566] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=8, node_rank=0
[2024-07-16 10:20:57,566] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2024-07-16 10:20:57,566] [INFO] [launch.py:163:main] dist_world_size=8
[2024-07-16 10:20:57,566] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2024-07-16 10:21:04,270] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-16 10:21:04,280] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-16 10:21:04,280] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-16 10:21:04,282] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-16 10:21:04,282] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-16 10:21:04,282] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-16 10:21:04,282] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-16 10:21:04,350] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-07-16 10:21:07,019] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-16 10:21:07,019] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-16 10:21:07,019] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-07-16 10:21:07,023] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-16 10:21:07,037] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-16 10:21:07,041] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-16 10:21:07,053] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-16 10:21:07,055] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-07-16 10:21:07,076] [INFO] [comm.py:637:init_distributed] cdb=None
07/16/2024 10:21:07 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
07/16/2024 10:21:07 - INFO - llmtuner.hparams.parser - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/16/2024 10:21:07 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
07/16/2024 10:21:07 - INFO - llmtuner.hparams.parser - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/16/2024 10:21:07 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
07/16/2024 10:21:07 - INFO - llmtuner.hparams.parser - Process rank: 4, device: cuda:4, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/16/2024 10:21:07 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
07/16/2024 10:21:07 - INFO - llmtuner.hparams.parser - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/16/2024 10:21:07 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
07/16/2024 10:21:07 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
07/16/2024 10:21:07 - INFO - llmtuner.hparams.parser - Process rank: 5, device: cuda:5, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/16/2024 10:21:07 - INFO - llmtuner.hparams.parser - Process rank: 7, device: cuda:7, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/16/2024 10:21:07 - INFO - llmtuner.data.template - Add <|im_end|>,<|endoftext|> to stop words.
07/16/2024 10:21:07 - INFO - llmtuner.data.template - Add <|im_end|>,<|endoftext|> to stop words.
07/16/2024 10:21:07 - INFO - llmtuner.data.template - Add <|im_end|>,<|endoftext|> to stop words.
07/16/2024 10:21:07 - INFO - llmtuner.data.template - Add <|im_end|>,<|endoftext|> to stop words.
07/16/2024 10:21:07 - INFO - llmtuner.data.template - Add <|im_end|>,<|endoftext|> to stop words.
07/16/2024 10:21:07 - INFO - llmtuner.data.template - Add <|im_end|>,<|endoftext|> to stop words.
07/16/2024 10:21:08 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
07/16/2024 10:21:08 - INFO - llmtuner.hparams.parser - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/16/2024 10:21:08 - WARNING - llmtuner.hparams.parser - `ddp_find_unused_parameters` needs to be set as False for LoRA in DDP training.
07/16/2024 10:21:08 - INFO - llmtuner.hparams.parser - Process rank: 6, device: cuda:6, n_gpu: 1, distributed training: True, compute dtype: torch.bfloat16
07/16/2024 10:21:08 - INFO - llmtuner.data.template - Add <|im_end|>,<|endoftext|> to stop words.
07/16/2024 10:21:08 - INFO - llmtuner.data.loader - Loading dataset /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_outputs/qwen_32B_iter4_d4_10k_output_filtered_evaluated_filtered.json...
07/16/2024 10:21:08 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
07/16/2024 10:21:08 - INFO - llmtuner.data.template - Add <|im_end|>,<|endoftext|> to stop words.
t-20240716144653-fl9g6-worker-0:82641:82641 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240716144653-fl9g6-worker-0:82641:82641 [0] NCCL INFO Bootstrap : Using eth1:172.25.8.127<0>
t-20240716144653-fl9g6-worker-0:82641:82641 [0] NCCL INFO cudaDriverVersion 12030
NCCL version 2.18.1+cuda12.1
t-20240716144653-fl9g6-worker-0:82644:82644 [3] NCCL INFO cudaDriverVersion 12030
t-20240716144653-fl9g6-worker-0:82645:82645 [4] NCCL INFO cudaDriverVersion 12030
t-20240716144653-fl9g6-worker-0:82644:82644 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240716144653-fl9g6-worker-0:82643:82643 [2] NCCL INFO cudaDriverVersion 12030
t-20240716144653-fl9g6-worker-0:82645:82645 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240716144653-fl9g6-worker-0:82643:82643 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240716144653-fl9g6-worker-0:82647:82647 [6] NCCL INFO cudaDriverVersion 12030
t-20240716144653-fl9g6-worker-0:82647:82647 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240716144653-fl9g6-worker-0:82648:82648 [7] NCCL INFO cudaDriverVersion 12030
t-20240716144653-fl9g6-worker-0:82648:82648 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240716144653-fl9g6-worker-0:82644:82644 [3] NCCL INFO Bootstrap : Using eth1:172.25.8.127<0>
t-20240716144653-fl9g6-worker-0:82645:82645 [4] NCCL INFO Bootstrap : Using eth1:172.25.8.127<0>
t-20240716144653-fl9g6-worker-0:82643:82643 [2] NCCL INFO Bootstrap : Using eth1:172.25.8.127<0>
t-20240716144653-fl9g6-worker-0:82648:82648 [7] NCCL INFO Bootstrap : Using eth1:172.25.8.127<0>
t-20240716144653-fl9g6-worker-0:82647:82647 [6] NCCL INFO Bootstrap : Using eth1:172.25.8.127<0>
t-20240716144653-fl9g6-worker-0:82646:82646 [5] NCCL INFO cudaDriverVersion 12030
t-20240716144653-fl9g6-worker-0:82642:82642 [1] NCCL INFO cudaDriverVersion 12030
t-20240716144653-fl9g6-worker-0:82646:82646 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240716144653-fl9g6-worker-0:82642:82642 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240716144653-fl9g6-worker-0:82646:82646 [5] NCCL INFO Bootstrap : Using eth1:172.25.8.127<0>
t-20240716144653-fl9g6-worker-0:82642:82642 [1] NCCL INFO Bootstrap : Using eth1:172.25.8.127<0>
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO P2P plugin IBext
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth1:172.25.8.127<0>
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Using network IBext
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO P2P plugin IBext
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO P2P plugin IBext
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO P2P plugin IBext
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO P2P plugin IBext
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO P2P plugin IBext
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO P2P plugin IBext
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO P2P plugin IBext
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth1
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth1:172.25.8.127<0>
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Using network IBext
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth1:172.25.8.127<0>
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Using network IBext
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth1:172.25.8.127<0>
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Using network IBext
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth1:172.25.8.127<0>
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Using network IBext
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth1:172.25.8.127<0>
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Using network IBext
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth1:172.25.8.127<0>
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Using network IBext
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO NCCL_IB_PCI_RELAXED_ORDERING set by environment to 1.
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [RO]; OOB eth1:172.25.8.127<0>
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Using network IBext
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO NVLS multicast support is not available on dev 0
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO NVLS multicast support is not available on dev 7
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO NVLS multicast support is not available on dev 3
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff,00000000
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO NVLS multicast support is not available on dev 4
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO NVLS multicast support is not available on dev 5
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO NVLS multicast support is not available on dev 6
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO NVLS multicast support is not available on dev 2
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO NVLS multicast support is not available on dev 1
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 00/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 01/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 02/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 03/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 04/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 05/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 06/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 07/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 08/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 09/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO P2P Chunksize set to 524288
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 10/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 11/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 12/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 13/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 14/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 15/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO P2P Chunksize set to 524288
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO P2P Chunksize set to 524288
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO P2P Chunksize set to 524288
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO P2P Chunksize set to 524288
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO P2P Chunksize set to 524288
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO P2P Chunksize set to 524288
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO P2P Chunksize set to 524288
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 00/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 01/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 00/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 01/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 02/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 03/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 04/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 05/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 06/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 02/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 07/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 03/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 04/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 08/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 05/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 09/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 06/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 00/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 00/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 00/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 10/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 00/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 00/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 00/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 07/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 01/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 01/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 01/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 11/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 01/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 01/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 01/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 12/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 02/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 02/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 08/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 02/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 02/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 02/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 02/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 13/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 03/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 03/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 09/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 03/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 03/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 03/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 03/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 04/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 04/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 04/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 14/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 04/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 04/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 04/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 10/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 05/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 05/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 05/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 15/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 05/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 05/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 11/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 05/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 06/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 06/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 06/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 06/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 06/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 12/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 06/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 07/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 07/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 07/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 07/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 07/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 13/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 07/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 14/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 15/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 08/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 08/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 08/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 08/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 08/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 09/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 08/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 09/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 09/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 09/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 09/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 09/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 10/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 10/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 10/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 10/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 11/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 12/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 11/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 13/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 11/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 10/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 10/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 11/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 12/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 12/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 11/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 14/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 11/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 12/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 13/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 12/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 15/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 13/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 14/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 13/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 12/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 13/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 14/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 15/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 13/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 14/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 14/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Channel 15/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 14/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 15/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 15/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 15/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Connected all rings
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Connected all rings
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Connected all rings
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 00/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Connected all rings
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Connected all rings
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 01/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 02/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Connected all rings
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Connected all rings
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Connected all rings
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 03/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 04/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 05/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 06/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 07/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 08/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 09/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 10/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 11/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 00/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 01/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 12/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 13/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 14/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 00/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 01/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 02/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 00/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 01/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 02/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 03/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 00/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 04/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 05/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 06/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Channel 15/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 00/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 03/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 04/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 05/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 06/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 07/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 08/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 09/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 00/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 02/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 10/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 01/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 03/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 02/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 04/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 03/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 04/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 05/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 06/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 07/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 01/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 02/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 08/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 03/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 09/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 04/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 10/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 01/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 02/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 03/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 05/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 11/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 06/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 12/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 07/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 13/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 08/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 14/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 07/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 08/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 09/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 10/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 11/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 12/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 05/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 06/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 07/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 08/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 11/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 09/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 12/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 13/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 14/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Channel 15/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 04/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 05/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 06/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 07/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 08/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 09/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Channel 15/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 09/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 10/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 11/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 13/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 14/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 10/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 11/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 10/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 11/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 12/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO Connected all trees
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 12/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 13/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 14/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Channel 15/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 12/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 13/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 14/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Channel 15/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 13/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 14/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Channel 15/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO Connected all trees
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Channel 15/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO Connected all trees
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO Connected all trees
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO Connected all trees
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO Connected all trees
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO Connected all trees
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO Connected all trees
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240716144653-fl9g6-worker-0:82645:83840 [4] NCCL INFO comm 0xc54c8b0 rank 4 nranks 8 cudaDev 4 busId c5000 commId 0xb03e6c7621795781 - Init COMPLETE
t-20240716144653-fl9g6-worker-0:82647:83842 [6] NCCL INFO comm 0xdd70680 rank 6 nranks 8 cudaDev 6 busId e0000 commId 0xb03e6c7621795781 - Init COMPLETE
t-20240716144653-fl9g6-worker-0:82646:83845 [5] NCCL INFO comm 0xc2dbf10 rank 5 nranks 8 cudaDev 5 busId ca000 commId 0xb03e6c7621795781 - Init COMPLETE
t-20240716144653-fl9g6-worker-0:82642:83844 [1] NCCL INFO comm 0xd856c40 rank 1 nranks 8 cudaDev 1 busId 13000 commId 0xb03e6c7621795781 - Init COMPLETE
t-20240716144653-fl9g6-worker-0:82641:83831 [0] NCCL INFO comm 0xc890c20 rank 0 nranks 8 cudaDev 0 busId d000 commId 0xb03e6c7621795781 - Init COMPLETE
t-20240716144653-fl9g6-worker-0:82643:83841 [2] NCCL INFO comm 0xc823910 rank 2 nranks 8 cudaDev 2 busId 29000 commId 0xb03e6c7621795781 - Init COMPLETE
t-20240716144653-fl9g6-worker-0:82644:83839 [3] NCCL INFO comm 0xda70480 rank 3 nranks 8 cudaDev 3 busId 2d000 commId 0xb03e6c7621795781 - Init COMPLETE
t-20240716144653-fl9g6-worker-0:82648:83843 [7] NCCL INFO comm 0xccc10f0 rank 7 nranks 8 cudaDev 7 busId e4000 commId 0xb03e6c7621795781 - Init COMPLETE
07/16/2024 10:21:12 - INFO - llmtuner.data.loader - Loading dataset /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_outputs/qwen_32B_iter4_d4_10k_output_filtered_evaluated_filtered.json...
07/16/2024 10:21:12 - INFO - llmtuner.data.loader - Loading dataset /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_outputs/qwen_32B_iter4_d4_10k_output_filtered_evaluated_filtered.json...
07/16/2024 10:21:12 - INFO - llmtuner.data.loader - Loading dataset /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_outputs/qwen_32B_iter4_d4_10k_output_filtered_evaluated_filtered.json...
07/16/2024 10:21:12 - INFO - llmtuner.data.loader - Loading dataset /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_outputs/qwen_32B_iter4_d4_10k_output_filtered_evaluated_filtered.json...
07/16/2024 10:21:12 - INFO - llmtuner.data.loader - Loading dataset /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_outputs/qwen_32B_iter4_d4_10k_output_filtered_evaluated_filtered.json...
07/16/2024 10:21:12 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
07/16/2024 10:21:12 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
07/16/2024 10:21:12 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
07/16/2024 10:21:12 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
07/16/2024 10:21:12 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
07/16/2024 10:21:12 - INFO - llmtuner.data.loader - Loading dataset /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_outputs/qwen_32B_iter4_d4_10k_output_filtered_evaluated_filtered.json...
07/16/2024 10:21:12 - INFO - llmtuner.data.loader - Loading dataset /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_outputs/qwen_32B_iter4_d4_10k_output_filtered_evaluated_filtered.json...
07/16/2024 10:21:12 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
07/16/2024 10:21:12 - WARNING - llmtuner.data.utils - Checksum failed: missing SHA-1 hash value in dataset_info.json.
input_ids:
[151644, 8948, 198, 2610, 525, 264, 10950, 17847, 13, 151645, 198, 151644, 872, 198, 840, 20772, 3170, 2494, 6932, 304, 279, 3669, 624, 2, 37866, 18712, 2142, 151645, 198, 151644, 77091, 198, 3862, 572, 458, 5263, 304, 4961, 5048, 315, 7147, 64810, 304, 279, 3669, 419, 1042, 13, 1096, 3363, 429, 803, 1251, 10321, 476, 31026, 21240, 476, 12213, 8806, 2348, 13350, 1251, 13, 2619, 525, 1657, 7966, 3170, 419, 2578, 387, 12482, 11, 1741, 438, 264, 10000, 304, 4948, 323, 3590, 38410, 11, 279, 8865, 315, 74059, 323, 53068, 389, 3590, 3687, 11, 323, 264, 6853, 315, 6731, 911, 13350, 7674, 323, 3840, 13, 1084, 594, 2989, 369, 1251, 311, 3960, 911, 323, 5091, 1817, 1008, 594, 11799, 311, 1855, 264, 803, 17774, 1223, 323, 28308, 8232, 13, 151643]
inputs:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Explain why something happened in the news.
#antisemitism<|im_end|>
<|im_start|>assistant
There was an increase in reported cases of anti-Semitism in the news this year. This means that more people experienced or witnessed discrimination or hate speech against Jewish people. There are many reasons why this might be happening, such as a rise in political and social tensions, the spread of misinformation and stereotypes on social media, and a lack of education about Jewish culture and history. It's important for people to learn about and respect each other's differences to create a more harmonious and inclusive society.<|endoftext|>
label_ids:
[-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, 3862, 572, 458, 5263, 304, 4961, 5048, 315, 7147, 64810, 304, 279, 3669, 419, 1042, 13, 1096, 3363, 429, 803, 1251, 10321, 476, 31026, 21240, 476, 12213, 8806, 2348, 13350, 1251, 13, 2619, 525, 1657, 7966, 3170, 419, 2578, 387, 12482, 11, 1741, 438, 264, 10000, 304, 4948, 323, 3590, 38410, 11, 279, 8865, 315, 74059, 323, 53068, 389, 3590, 3687, 11, 323, 264, 6853, 315, 6731, 911, 13350, 7674, 323, 3840, 13, 1084, 594, 2989, 369, 1251, 311, 3960, 911, 323, 5091, 1817, 1008, 594, 11799, 311, 1855, 264, 803, 17774, 1223, 323, 28308, 8232, 13, 151643]
labels:
There was an increase in reported cases of anti-Semitism in the news this year. This means that more people experienced or witnessed discrimination or hate speech against Jewish people. There are many reasons why this might be happening, such as a rise in political and social tensions, the spread of misinformation and stereotypes on social media, and a lack of education about Jewish culture and history. It's important for people to learn about and respect each other's differences to create a more harmonious and inclusive society.<|endoftext|>
[2024-07-16 10:21:55,287] [INFO] [partition_parameters.py:349:__exit__] finished initializing model - num_params = 771, num_elems = 32.51B
07/16/2024 10:22:23 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
07/16/2024 10:22:23 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
07/16/2024 10:22:23 - INFO - llmtuner.model.utils - Found linear modules: down_proj,k_proj,q_proj,o_proj,v_proj,up_proj,gate_proj
07/16/2024 10:22:23 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
07/16/2024 10:22:23 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
07/16/2024 10:22:23 - INFO - llmtuner.model.utils - Found linear modules: gate_proj,o_proj,k_proj,down_proj,up_proj,v_proj,q_proj
07/16/2024 10:22:23 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
07/16/2024 10:22:23 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
07/16/2024 10:22:23 - INFO - llmtuner.model.utils - Found linear modules: up_proj,gate_proj,k_proj,v_proj,q_proj,o_proj,down_proj
07/16/2024 10:22:23 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
07/16/2024 10:22:23 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
07/16/2024 10:22:23 - INFO - llmtuner.model.utils - Found linear modules: gate_proj,up_proj,o_proj,down_proj,q_proj,k_proj,v_proj
07/16/2024 10:22:23 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
07/16/2024 10:22:23 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
07/16/2024 10:22:23 - INFO - llmtuner.model.utils - Found linear modules: o_proj,q_proj,k_proj,gate_proj,up_proj,v_proj,down_proj
07/16/2024 10:22:23 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
07/16/2024 10:22:23 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
07/16/2024 10:22:23 - INFO - llmtuner.model.utils - Found linear modules: down_proj,up_proj,k_proj,gate_proj,q_proj,v_proj,o_proj
07/16/2024 10:22:23 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
07/16/2024 10:22:23 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
07/16/2024 10:22:23 - INFO - llmtuner.model.utils - Found linear modules: up_proj,down_proj,q_proj,k_proj,v_proj,gate_proj,o_proj
07/16/2024 10:22:25 - INFO - llmtuner.model.patcher - Gradient checkpointing enabled.
07/16/2024 10:22:25 - INFO - llmtuner.model.adapter - Fine-tuning method: LoRA
07/16/2024 10:22:25 - INFO - llmtuner.model.utils - Found linear modules: v_proj,k_proj,gate_proj,q_proj,up_proj,down_proj,o_proj
07/16/2024 10:23:18 - INFO - llmtuner.model.loader - trainable params: 66715648 || all params: 32578933760 || trainable%: 0.2048
07/16/2024 10:23:18 - INFO - llmtuner.model.loader - trainable params: 66715648 || all params: 32578933760 || trainable%: 0.2048
07/16/2024 10:23:18 - INFO - llmtuner.model.loader - trainable params: 66715648 || all params: 32578933760 || trainable%: 0.2048
07/16/2024 10:23:18 - INFO - llmtuner.model.loader - trainable params: 66715648 || all params: 32578933760 || trainable%: 0.2048
07/16/2024 10:23:18 - INFO - llmtuner.model.loader - trainable params: 66715648 || all params: 32578933760 || trainable%: 0.2048
07/16/2024 10:23:18 - INFO - llmtuner.model.loader - trainable params: 66715648 || all params: 32578933760 || trainable%: 0.2048
07/16/2024 10:23:18 - INFO - llmtuner.model.loader - trainable params: 66715648 || all params: 32578933760 || trainable%: 0.2048
07/16/2024 10:23:18 - INFO - llmtuner.model.loader - trainable params: 66715648 || all params: 32578933760 || trainable%: 0.2048
[2024-07-16 10:23:19,314] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed info: version=0.13.0, git-hash=unknown, git-branch=unknown
[2024-07-16 10:23:19,392] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2024-07-16 10:23:19,399] [INFO] [logging.py:96:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2024-07-16 10:23:19,399] [INFO] [logging.py:96:log_dist] [Rank 0] Removing param_group that has no 'params' in the basic Optimizer
[2024-07-16 10:23:19,530] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = AdamW
[2024-07-16 10:23:19,531] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2024-07-16 10:23:19,531] [INFO] [logging.py:96:log_dist] [Rank 0] Creating fp16 ZeRO stage 3 optimizer, MiCS is enabled False, Hierarchical params gather False
[2024-07-16 10:23:19,531] [INFO] [logging.py:96:log_dist] [Rank 0] Creating torch.bfloat16 ZeRO stage 3 optimizer
[2024-07-16 10:23:19,819] [INFO] [utils.py:791:see_memory_usage] Stage 3 initialize beginning
[2024-07-16 10:23:19,819] [INFO] [utils.py:792:see_memory_usage] MA 8.75 GB Max_MA 11.53 GB CA 10.26 GB Max_CA 21 GB
[2024-07-16 10:23:19,820] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.41 GB, percent = 0.5%
[2024-07-16 10:23:19,836] [INFO] [stage3.py:128:__init__] Reduce bucket size 26214400
[2024-07-16 10:23:19,837] [INFO] [stage3.py:129:__init__] Prefetch bucket size 23592960
[2024-07-16 10:23:20,134] [INFO] [utils.py:791:see_memory_usage] DeepSpeedZeRoOffload initialize [begin]
[2024-07-16 10:23:20,135] [INFO] [utils.py:792:see_memory_usage] MA 8.75 GB Max_MA 8.75 GB CA 10.26 GB Max_CA 10 GB
[2024-07-16 10:23:20,136] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.44 GB, percent = 0.5%
Parameter Offload: Total persistent parameters: 25760768 in 1025 params
[2024-07-16 10:23:21,103] [INFO] [utils.py:791:see_memory_usage] DeepSpeedZeRoOffload initialize [end]
[2024-07-16 10:23:21,103] [INFO] [utils.py:792:see_memory_usage] MA 8.64 GB Max_MA 8.75 GB CA 10.26 GB Max_CA 10 GB
[2024-07-16 10:23:21,104] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.47 GB, percent = 0.5%
[2024-07-16 10:23:21,392] [INFO] [utils.py:791:see_memory_usage] Before creating fp16 partitions
[2024-07-16 10:23:21,392] [INFO] [utils.py:792:see_memory_usage] MA 8.64 GB Max_MA 8.64 GB CA 10.26 GB Max_CA 10 GB
[2024-07-16 10:23:21,393] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.47 GB, percent = 0.5%
[2024-07-16 10:23:22,283] [INFO] [utils.py:791:see_memory_usage] After creating fp16 partitions: 1
[2024-07-16 10:23:22,285] [INFO] [utils.py:792:see_memory_usage] MA 8.64 GB Max_MA 8.64 GB CA 10.02 GB Max_CA 10 GB
[2024-07-16 10:23:22,285] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.59 GB, percent = 0.5%
[2024-07-16 10:23:22,574] [INFO] [utils.py:791:see_memory_usage] Before creating fp32 partitions
[2024-07-16 10:23:22,575] [INFO] [utils.py:792:see_memory_usage] MA 8.64 GB Max_MA 8.64 GB CA 10.02 GB Max_CA 10 GB
[2024-07-16 10:23:22,576] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.59 GB, percent = 0.5%
[2024-07-16 10:23:22,864] [INFO] [utils.py:791:see_memory_usage] After creating fp32 partitions
[2024-07-16 10:23:22,864] [INFO] [utils.py:792:see_memory_usage] MA 8.67 GB Max_MA 8.69 GB CA 10.02 GB Max_CA 10 GB
[2024-07-16 10:23:22,865] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.59 GB, percent = 0.5%
[2024-07-16 10:23:23,151] [INFO] [utils.py:791:see_memory_usage] Before initializing optimizer states
[2024-07-16 10:23:23,151] [INFO] [utils.py:792:see_memory_usage] MA 8.67 GB Max_MA 8.67 GB CA 10.02 GB Max_CA 10 GB
[2024-07-16 10:23:23,152] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.63 GB, percent = 0.5%
[2024-07-16 10:23:23,468] [INFO] [utils.py:791:see_memory_usage] After initializing optimizer states
[2024-07-16 10:23:23,469] [INFO] [utils.py:792:see_memory_usage] MA 8.74 GB Max_MA 8.8 GB CA 10.02 GB Max_CA 10 GB
[2024-07-16 10:23:23,469] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.64 GB, percent = 0.5%
[2024-07-16 10:23:23,470] [INFO] [stage3.py:482:_setup_for_real_optimizer] optimizer state initialized
[2024-07-16 10:23:24,125] [INFO] [utils.py:791:see_memory_usage] After initializing ZeRO optimizer
[2024-07-16 10:23:24,126] [INFO] [utils.py:792:see_memory_usage] MA 8.8 GB Max_MA 8.8 GB CA 10.02 GB Max_CA 10 GB
[2024-07-16 10:23:24,127] [INFO] [utils.py:799:see_memory_usage] CPU Virtual Memory: used = 9.77 GB, percent = 0.5%
[2024-07-16 10:23:24,127] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = AdamW
[2024-07-16 10:23:24,127] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed using client LR scheduler
[2024-07-16 10:23:24,127] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed LR Scheduler = None
[2024-07-16 10:23:24,127] [INFO] [logging.py:96:log_dist] [Rank 0] step=0, skipped=0, lr=[0.0], mom=[(0.9, 0.999)]
[2024-07-16 10:23:24,136] [INFO] [config.py:984:print] DeepSpeedEngine configuration:
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] activation_checkpointing_config {
"partition_activations": false,
"contiguous_memory_optimization": false,
"cpu_checkpointing": false,
"number_checkpoints": null,
"synchronize_checkpoint_boundary": false,
"profile": false
}
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] aio_config ................... {'block_size': 1048576, 'queue_depth': 8, 'thread_count': 1, 'single_submit': False, 'overlap_events': True}
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] amp_enabled .................. False
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] amp_params ................... False
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] autotuning_config ............ {
"enabled": false,
"start_step": null,
"end_step": null,
"metric_path": null,
"arg_mappings": null,
"metric": "throughput",
"model_info": null,
"results_dir": "autotuning_results",
"exps_dir": "autotuning_exps",
"overwrite": true,
"fast": true,
"start_profile_step": 3,
"end_profile_step": 5,
"tuner_type": "gridsearch",
"tuner_early_stopping": 5,
"tuner_num_trials": 50,
"model_info_path": null,
"mp_size": 1,
"max_train_batch_size": null,
"min_train_batch_size": 1,
"max_train_micro_batch_size_per_gpu": 1.024000e+03,
"min_train_micro_batch_size_per_gpu": 1,
"num_tuning_micro_batch_sizes": 3
}
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] bfloat16_enabled ............. True
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] checkpoint_parallel_write_pipeline False
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] checkpoint_tag_validation_enabled True
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] checkpoint_tag_validation_fail False
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] comms_config ................. <deepspeed.comm.config.DeepSpeedCommsConfig object at 0x7fdd08ee7b80>
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] communication_data_type ...... None
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] compression_config ........... {'weight_quantization': {'shared_parameters': {'enabled': False, 'quantizer_kernel': False, 'schedule_offset': 0, 'quantize_groups': 1, 'quantize_verbose': False, 'quantization_type': 'symmetric', 'quantize_weight_in_forward': False, 'rounding': 'nearest', 'fp16_mixed_quantize': False, 'quantize_change_ratio': 0.001}, 'different_groups': {}}, 'activation_quantization': {'shared_parameters': {'enabled': False, 'quantization_type': 'symmetric', 'range_calibration': 'dynamic', 'schedule_offset': 1000}, 'different_groups': {}}, 'sparse_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'row_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'head_pruning': {'shared_parameters': {'enabled': False, 'method': 'topk', 'schedule_offset': 1000}, 'different_groups': {}}, 'channel_pruning': {'shared_parameters': {'enabled': False, 'method': 'l1', 'schedule_offset': 1000}, 'different_groups': {}}, 'layer_reduction': {'enabled': False}}
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] curriculum_enabled_legacy .... False
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] curriculum_params_legacy ..... False
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] data_efficiency_config ....... {'enabled': False, 'seed': 1234, 'data_sampling': {'enabled': False, 'num_epochs': 1000, 'num_workers': 0, 'curriculum_learning': {'enabled': False}}, 'data_routing': {'enabled': False, 'random_ltd': {'enabled': False, 'layer_token_lr_schedule': {'enabled': False}}}}
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] data_efficiency_enabled ...... False
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] dataloader_drop_last ......... False
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] disable_allgather ............ False
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] dump_state ................... False
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] dynamic_loss_scale_args ...... None
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] eigenvalue_enabled ........... False
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] eigenvalue_gas_boundary_resolution 1
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] eigenvalue_layer_name ........ bert.encoder.layer
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] eigenvalue_layer_num ......... 0
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] eigenvalue_max_iter .......... 100
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] eigenvalue_stability ......... 1e-06
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] eigenvalue_tol ............... 0.01
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] eigenvalue_verbose ........... False
[2024-07-16 10:23:24,137] [INFO] [config.py:988:print] elasticity_enabled ........... False
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] flops_profiler_config ........ {
"enabled": false,
"recompute_fwd_factor": 0.0,
"profile_step": 1,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": null
}
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] fp16_auto_cast ............... None
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] fp16_enabled ................. False
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] fp16_master_weights_and_gradients False
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] global_rank .................. 0
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] grad_accum_dtype ............. None
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] gradient_accumulation_steps .. 2
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] gradient_clipping ............ 1.0
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] gradient_predivide_factor .... 1.0
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] graph_harvesting ............. False
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] hybrid_engine ................ enabled=False max_out_tokens=512 inference_tp_size=1 release_inference_cache=False pin_parameters=True tp_gather_partition_size=8
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] initial_dynamic_scale ........ 1
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] load_universal_checkpoint .... False
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] loss_scale ................... 1.0
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] memory_breakdown ............. False
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] mics_hierarchial_params_gather False
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] mics_shard_size .............. -1
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] monitor_config ............... tensorboard=TensorBoardConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') wandb=WandbConfig(enabled=False, group=None, team=None, project='deepspeed') csv_monitor=CSVConfig(enabled=False, output_path='', job_name='DeepSpeedJobName') enabled=False
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] nebula_config ................ {
"enabled": false,
"persistent_storage_path": null,
"persistent_time_interval": 100,
"num_of_version_in_retention": 2,
"enable_nebula_load": true,
"load_path": null
}
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] optimizer_legacy_fusion ...... False
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] optimizer_name ............... None
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] optimizer_params ............. None
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0, 'pipe_partitioned': True, 'grad_partitioned': True}
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] pld_enabled .................. False
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] pld_params ................... False
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] prescale_gradients ........... False
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] scheduler_name ............... None
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] scheduler_params ............. None
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] seq_parallel_communication_data_type torch.float32
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] sparse_attention ............. None
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] sparse_gradients_enabled ..... False
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] steps_per_print .............. inf
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] train_batch_size ............. 16
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] train_micro_batch_size_per_gpu 1
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] use_data_before_expert_parallel_ False
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] use_node_local_storage ....... False
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] wall_clock_breakdown ......... False
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] weight_quantization_config ... None
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] world_size ................... 8
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] zero_allow_untested_optimizer True
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] zero_config .................. stage=3 contiguous_gradients=True reduce_scatter=True reduce_bucket_size=26214400 use_multi_rank_bucket_allreduce=True allgather_partitions=True allgather_bucket_size=500,000,000 overlap_comm=True load_from_fp32_weights=True elastic_checkpoint=False offload_param=None offload_optimizer=None sub_group_size=1000000000 cpu_offload_param=None cpu_offload_use_pin_memory=None cpu_offload=None prefetch_bucket_size=23592960 param_persistence_threshold=51200 model_persistence_threshold=sys.maxsize max_live_parameters=1000000000 max_reuse_distance=1000000000 gather_16bit_weights_on_model_save=True stage3_gather_fp16_weights_on_model_save=False ignore_unused_parameters=True legacy_stage1=False round_robin_gradients=False zero_hpz_partition_size=1 zero_quantized_weights=False zero_quantized_nontrainable_weights=False zero_quantized_gradients=False mics_shard_size=-1 mics_hierarchical_params_gather=False memory_efficient_linear=True pipeline_loading_checkpoint=False override_module_apply=True
[2024-07-16 10:23:24,138] [INFO] [config.py:988:print] zero_enabled ................. True
[2024-07-16 10:23:24,139] [INFO] [config.py:988:print] zero_force_ds_cpu_optimizer .. True
[2024-07-16 10:23:24,139] [INFO] [config.py:988:print] zero_optimization_stage ...... 3
[2024-07-16 10:23:24,139] [INFO] [config.py:974:print_user_config] json = {
"train_batch_size": 16,
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 2,
"gradient_clipping": 1.0,
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1.000000e+09,
"reduce_bucket_size": 2.621440e+07,
"stage3_prefetch_bucket_size": 2.359296e+07,
"stage3_param_persistence_threshold": 5.120000e+04,
"stage3_max_live_parameters": 1.000000e+09,
"stage3_max_reuse_distance": 1.000000e+09,
"stage3_gather_16bit_weights_on_model_save": true
},
"steps_per_print": inf
}
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Using network IBext
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Using network IBext
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Using network IBext
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Using network IBext
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Using network IBext
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Using network IBext
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Using network IBext
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Using network IBext
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Setting affinity for GPU 3 to ffffffff,00000000,ffffffff
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO NVLS multicast support is not available on dev 3
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Setting affinity for GPU 5 to ffffffff,00000000,ffffffff,00000000
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO NVLS multicast support is not available on dev 5
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,00000000,ffffffff
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO NVLS multicast support is not available on dev 0
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,00000000,ffffffff
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO NVLS multicast support is not available on dev 1
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Setting affinity for GPU 2 to ffffffff,00000000,ffffffff
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO NVLS multicast support is not available on dev 2
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Setting affinity for GPU 6 to ffffffff,00000000,ffffffff,00000000
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO NVLS multicast support is not available on dev 6
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Setting affinity for GPU 4 to ffffffff,00000000,ffffffff,00000000
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Setting affinity for GPU 7 to ffffffff,00000000,ffffffff,00000000
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO NVLS multicast support is not available on dev 7
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO NVLS multicast support is not available on dev 4
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Trees [0] 4/-1/-1->3->2 [1] 4/-1/-1->3->2 [2] 4/-1/-1->3->2 [3] 4/-1/-1->3->2 [4] 4/-1/-1->3->2 [5] 4/-1/-1->3->2 [6] 4/-1/-1->3->2 [7] 4/-1/-1->3->2 [8] 4/-1/-1->3->2 [9] 4/-1/-1->3->2 [10] 4/-1/-1->3->2 [11] 4/-1/-1->3->2 [12] 4/-1/-1->3->2 [13] 4/-1/-1->3->2 [14] 4/-1/-1->3->2 [15] 4/-1/-1->3->2
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO P2P Chunksize set to 524288
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Trees [0] 3/-1/-1->2->1 [1] 3/-1/-1->2->1 [2] 3/-1/-1->2->1 [3] 3/-1/-1->2->1 [4] 3/-1/-1->2->1 [5] 3/-1/-1->2->1 [6] 3/-1/-1->2->1 [7] 3/-1/-1->2->1 [8] 3/-1/-1->2->1 [9] 3/-1/-1->2->1 [10] 3/-1/-1->2->1 [11] 3/-1/-1->2->1 [12] 3/-1/-1->2->1 [13] 3/-1/-1->2->1 [14] 3/-1/-1->2->1 [15] 3/-1/-1->2->1
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO P2P Chunksize set to 524288
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 00/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 01/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Trees [0] 5/-1/-1->4->3 [1] 5/-1/-1->4->3 [2] 5/-1/-1->4->3 [3] 5/-1/-1->4->3 [4] 5/-1/-1->4->3 [5] 5/-1/-1->4->3 [6] 5/-1/-1->4->3 [7] 5/-1/-1->4->3 [8] 5/-1/-1->4->3 [9] 5/-1/-1->4->3 [10] 5/-1/-1->4->3 [11] 5/-1/-1->4->3 [12] 5/-1/-1->4->3 [13] 5/-1/-1->4->3 [14] 5/-1/-1->4->3 [15] 5/-1/-1->4->3
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 02/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Trees [0] 2/-1/-1->1->0 [1] 2/-1/-1->1->0 [2] 2/-1/-1->1->0 [3] 2/-1/-1->1->0 [4] 2/-1/-1->1->0 [5] 2/-1/-1->1->0 [6] 2/-1/-1->1->0 [7] 2/-1/-1->1->0 [8] 2/-1/-1->1->0 [9] 2/-1/-1->1->0 [10] 2/-1/-1->1->0 [11] 2/-1/-1->1->0 [12] 2/-1/-1->1->0 [13] 2/-1/-1->1->0 [14] 2/-1/-1->1->0 [15] 2/-1/-1->1->0
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 03/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO P2P Chunksize set to 524288
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO P2P Chunksize set to 524288
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 04/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Trees [0] 7/-1/-1->6->5 [1] 7/-1/-1->6->5 [2] 7/-1/-1->6->5 [3] 7/-1/-1->6->5 [4] 7/-1/-1->6->5 [5] 7/-1/-1->6->5 [6] 7/-1/-1->6->5 [7] 7/-1/-1->6->5 [8] 7/-1/-1->6->5 [9] 7/-1/-1->6->5 [10] 7/-1/-1->6->5 [11] 7/-1/-1->6->5 [12] 7/-1/-1->6->5 [13] 7/-1/-1->6->5 [14] 7/-1/-1->6->5 [15] 7/-1/-1->6->5
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 05/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Trees [0] 6/-1/-1->5->4 [1] 6/-1/-1->5->4 [2] 6/-1/-1->5->4 [3] 6/-1/-1->5->4 [4] 6/-1/-1->5->4 [5] 6/-1/-1->5->4 [6] 6/-1/-1->5->4 [7] 6/-1/-1->5->4 [8] 6/-1/-1->5->4 [9] 6/-1/-1->5->4 [10] 6/-1/-1->5->4 [11] 6/-1/-1->5->4 [12] 6/-1/-1->5->4 [13] 6/-1/-1->5->4 [14] 6/-1/-1->5->4 [15] 6/-1/-1->5->4
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO P2P Chunksize set to 524288
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Trees [0] -1/-1/-1->7->6 [1] -1/-1/-1->7->6 [2] -1/-1/-1->7->6 [3] -1/-1/-1->7->6 [4] -1/-1/-1->7->6 [5] -1/-1/-1->7->6 [6] -1/-1/-1->7->6 [7] -1/-1/-1->7->6 [8] -1/-1/-1->7->6 [9] -1/-1/-1->7->6 [10] -1/-1/-1->7->6 [11] -1/-1/-1->7->6 [12] -1/-1/-1->7->6 [13] -1/-1/-1->7->6 [14] -1/-1/-1->7->6 [15] -1/-1/-1->7->6
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO P2P Chunksize set to 524288
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 06/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO P2P Chunksize set to 524288
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 07/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 08/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 09/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 10/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 11/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 12/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 13/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 14/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 15/16 : 0 1 2 3 4 5 6 7
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO P2P Chunksize set to 524288
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 00/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 01/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 02/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 03/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 04/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 05/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 06/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 07/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 08/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 09/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 10/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 11/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 12/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 13/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 14/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 15/0 : 5[ca000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 00/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 00/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 00/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 00/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 00/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 00/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 00/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 01/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 01/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 01/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 01/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 01/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 01/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 01/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 02/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 02/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 02/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 02/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 02/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 02/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 03/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 03/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 03/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 03/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 03/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 02/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 03/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 04/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 05/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 06/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 07/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 03/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 08/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 04/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 04/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 04/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 04/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 04/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 04/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 05/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 05/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 05/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 05/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 05/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 05/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 09/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 10/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 11/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 12/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 13/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 14/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 06/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 06/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 06/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 06/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 06/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 06/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 07/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 07/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 07/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 07/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 07/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 07/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 15/0 : 6[e0000] -> 7[e4000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 08/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 08/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 08/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 08/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 08/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 08/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 09/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 09/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 09/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 09/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 09/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 09/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 10/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 10/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 10/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 10/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 10/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 10/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 11/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 11/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 11/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 11/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 11/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 11/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 12/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 12/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 12/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 12/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 12/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 13/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 13/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 13/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 13/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 13/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 12/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 14/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 14/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 14/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 13/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 14/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 14/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 15/0 : 7[e4000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 15/0 : 3[2d000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 15/0 : 2[29000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 15/0 : 4[c5000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 14/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Channel 15/0 : 0[d000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 15/0 : 1[13000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Connected all rings
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Connected all rings
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Connected all rings
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Connected all rings
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 00/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Connected all rings
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Connected all rings
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Connected all rings
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Connected all rings
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 01/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 02/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 03/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 04/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 05/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 06/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 07/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 08/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 09/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 10/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 00/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 01/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 11/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 02/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 12/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 03/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 04/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 05/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 06/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 07/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 08/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 00/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 09/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 13/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 01/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 10/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 14/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 00/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 02/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 11/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Channel 15/0 : 7[e4000] -> 6[e0000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 03/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 01/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 12/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 00/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 00/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 00/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 13/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 01/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 04/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 01/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 01/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 02/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 02/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 14/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 05/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 02/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 02/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 03/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 03/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 03/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 04/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 04/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 04/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Channel 15/0 : 3[2d000] -> 2[29000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 03/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 06/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 05/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 05/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 05/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 07/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 04/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 06/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 06/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 07/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 07/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 05/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 08/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 06/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 08/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 08/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 06/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 09/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 07/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 09/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 10/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 11/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 09/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 10/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 12/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 10/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 11/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 08/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 07/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 11/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 12/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 09/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 08/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 13/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 14/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Channel 15/0 : 5[ca000] -> 4[c5000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 09/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 10/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 13/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 12/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 10/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 14/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 11/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 13/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 14/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Channel 15/0 : 1[13000] -> 0[d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Channel 15/0 : 6[e0000] -> 5[ca000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 11/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 12/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 12/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 13/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 13/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO Connected all trees
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO Connected all trees
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO Connected all trees
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 14/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Channel 15/0 : 2[29000] -> 1[13000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 14/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Channel 15/0 : 4[c5000] -> 3[2d000] via P2P/IPC/read
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO Connected all trees
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO Connected all trees
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO Connected all trees
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO Connected all trees
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO Connected all trees
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO 16 coll channels, 0 nvls channels, 16 p2p channels, 16 p2p channels per peer
t-20240716144653-fl9g6-worker-0:82641:85578 [0] NCCL INFO comm 0x7fd950f05e60 rank 0 nranks 8 cudaDev 0 busId d000 commId 0x8c8fbbd99bae4c61 - Init COMPLETE
t-20240716144653-fl9g6-worker-0:82644:85580 [3] NCCL INFO comm 0x7fe6d0f6dde0 rank 3 nranks 8 cudaDev 3 busId 2d000 commId 0x8c8fbbd99bae4c61 - Init COMPLETE
t-20240716144653-fl9g6-worker-0:82648:85582 [7] NCCL INFO comm 0x7f5044f8a470 rank 7 nranks 8 cudaDev 7 busId e4000 commId 0x8c8fbbd99bae4c61 - Init COMPLETE
t-20240716144653-fl9g6-worker-0:82645:85581 [4] NCCL INFO comm 0x7f5094f7f020 rank 4 nranks 8 cudaDev 4 busId c5000 commId 0x8c8fbbd99bae4c61 - Init COMPLETE
t-20240716144653-fl9g6-worker-0:82647:85585 [6] NCCL INFO comm 0x7fe4fe9dade0 rank 6 nranks 8 cudaDev 6 busId e0000 commId 0x8c8fbbd99bae4c61 - Init COMPLETE
t-20240716144653-fl9g6-worker-0:82646:85584 [5] NCCL INFO comm 0x7f13b2b073b0 rank 5 nranks 8 cudaDev 5 busId ca000 commId 0x8c8fbbd99bae4c61 - Init COMPLETE
t-20240716144653-fl9g6-worker-0:82643:85579 [2] NCCL INFO comm 0x7f8cdce5c400 rank 2 nranks 8 cudaDev 2 busId 29000 commId 0x8c8fbbd99bae4c61 - Init COMPLETE
t-20240716144653-fl9g6-worker-0:82642:85583 [1] NCCL INFO comm 0x7f1828f863c0 rank 1 nranks 8 cudaDev 1 busId 13000 commId 0x8c8fbbd99bae4c61 - Init COMPLETE
{'loss': 1.803, 'grad_norm': 0.9341992598686405, 'learning_rate': 2.5e-05, 'epoch': 0.25}
{'loss': 1.0746, 'grad_norm': 0.6323953977964472, 'learning_rate': 5e-05, 'epoch': 0.5}
{'loss': 0.6122, 'grad_norm': 0.5774023152680935, 'learning_rate': 4.665063509461097e-05, 'epoch': 0.75}
{'loss': 0.4901, 'grad_norm': 1.1125529093008761, 'learning_rate': 3.7500000000000003e-05, 'epoch': 1.0}
{'loss': 0.4955, 'grad_norm': 0.44578440177535733, 'learning_rate': 2.5e-05, 'epoch': 1.25}
{'loss': 0.4732, 'grad_norm': 1.091411501996564, 'learning_rate': 1.2500000000000006e-05, 'epoch': 1.5}
{'loss': 0.4394, 'grad_norm': 1.2150014463998866, 'learning_rate': 3.3493649053890326e-06, 'epoch': 1.75}
{'loss': 0.4332, 'grad_norm': 0.6264092933696136, 'learning_rate': 0.0, 'epoch': 2.0}
{'train_runtime': 429.1717, 'train_samples_per_second': 2.95, 'train_steps_per_second': 0.186, 'train_loss': 0.7276630520820617, 'epoch': 2.0}
***** train metrics *****
epoch = 2.0
total_flos = 25712GF
train_loss = 0.7277
train_runtime = 0:07:09.17
train_samples_per_second = 2.95
train_steps_per_second = 0.186
Figure saved at: /ML-A100/team/mm/eamon/self_instruction/seed_ppl/qwen32B_models/qwen_32B_d4_iter5_model/training_loss.png
07/16/2024 10:31:10 - WARNING - llmtuner.extras.ploting - No metric eval_loss to plot.
***** eval metrics *****
epoch = 2.0
eval_loss = 0.4399
eval_runtime = 0:00:08.97
eval_samples_per_second = 7.915
eval_steps_per_second = 1.003
[2024-07-16 10:31:21,251] [INFO] [launch.py:347:main] Process 82643 exits successfully.
[2024-07-16 10:31:22,252] [INFO] [launch.py:347:main] Process 82644 exits successfully.
[2024-07-16 10:31:22,253] [INFO] [launch.py:347:main] Process 82646 exits successfully.
[2024-07-16 10:31:22,253] [INFO] [launch.py:347:main] Process 82642 exits successfully.
[2024-07-16 10:31:22,253] [INFO] [launch.py:347:main] Process 82645 exits successfully.
[2024-07-16 10:31:22,253] [INFO] [launch.py:347:main] Process 82647 exits successfully.
[2024-07-16 10:31:22,253] [INFO] [launch.py:347:main] Process 82648 exits successfully.
[2024-07-16 10:31:22,254] [INFO] [launch.py:347:main] Process 82641 exits successfully.