vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1447, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0887, device='cuda:0', grad_fn=) [2024-06-18 22:28:28,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 22:28:28,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3560.24 | bwd_microstep: 1932.49 | bwd_inner_microstep: 1927.03 | bwd_allreduce_microstep: 5.35 | step_microstep: 61.40 [2024-06-18 22:28:28,983] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5263.44 | bwd: 2773.33 | bwd_inner: 2762.70 | bwd_allreduce: 10.45 | step: 61.48 12%|█▏ | 73/600 [13:22<1:29:15, 10.16s/it] {'loss': 0.6327, 'learning_rate': 9.781260668967628e-05, 'epoch': 0.73} 12%|█▏ | 73/600 [13:22<1:29:15, 10.16s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1870, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.1382, device='cuda:0', grad_fn=) [2024-06-18 22:28:34,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3562.42 | bwd_microstep: 1921.67 | bwd_inner_microstep: 1916.72 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8313, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8181, device='cuda:0', grad_fn=) [2024-06-18 22:28:39,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 22:28:39,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2877.28 | bwd_microstep: 1683.12 | bwd_inner_microstep: 1677.41 | bwd_allreduce_microstep: 5.59 | step_microstep: 61.88 [2024-06-18 22:28:39,286] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6439.69 | bwd: 3604.78 | bwd_inner: 3594.15 | bwd_allreduce: 10.43 | step: 61.96 12%|█▏ | 74/600 [13:32<1:29:27, 10.20s/it] {'loss': 0.9782, 'learning_rate': 9.773295402873026e-05, 'epoch': 0.74} 12%|█▏ | 74/600 [13:32<1:29:27, 10.20s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.4081, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.3373, device='cuda:0', grad_fn=) [2024-06-18 22:28:44,881] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3560.90 | bwd_microstep: 1929.52 | bwd_inner_microstep: 1924.46 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0339, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(1.0008, device='cuda:0', grad_fn=) [2024-06-18 22:28:50,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:28:50,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3554.10 | bwd_microstep: 1897.97 | bwd_inner_microstep: 1892.58 | bwd_allreduce_microstep: 5.31 | step_microstep: 61.36 [2024-06-18 22:28:50,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7114.96 | bwd: 3827.48 | bwd_inner: 3817.07 | bwd_allreduce: 10.24 | step: 61.45 12%|█▎ | 75/600 [13:43<1:31:55, 10.50s/it] {'loss': 1.169, 'learning_rate': 9.765191054744305e-05, 'epoch': 0.75} 12%|█▎ | 75/600 [13:43<1:31:55, 10.50s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0511, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1275, device='cuda:0', grad_fn=) [2024-06-18 22:28:55,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3476.19 | bwd_microstep: 1739.00 | bwd_inner_microstep: 1734.07 | bwd_allreduce_microstep: 4.83 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9450, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9205, device='cuda:0', grad_fn=) [2024-06-18 22:29:01,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 22:29:01,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3554.60 | bwd_microstep: 1896.90 | bwd_inner_microstep: 1891.21 | bwd_allreduce_microstep: 5.52 | step_microstep: 61.86 [2024-06-18 22:29:01,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7030.77 | bwd: 3635.89 | bwd_inner: 3625.33 | bwd_allreduce: 10.35 | step: 61.95 13%|█▎ | 76/600 [13:54<1:32:49, 10.63s/it] {'loss': 0.524, 'learning_rate': 9.756947860722143e-05, 'epoch': 0.76} 13%|█▎ | 76/600 [13:54<1:32:49, 10.63s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8368, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8231, device='cuda:0', grad_fn=) [2024-06-18 22:29:07,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3581.39 | bwd_microstep: 1962.88 | bwd_inner_microstep: 1957.80 | bwd_allreduce_microstep: 4.96 | step_microstep: 0.09 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0345, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0011, device='cuda:0', grad_fn=) [2024-06-18 22:29:11,478] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:29:11,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2650.06 | bwd_microstep: 1616.44 | bwd_inner_microstep: 1610.97 | bwd_allreduce_microstep: 5.36 | step_microstep: 61.11 [2024-06-18 22:29:11,479] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6231.42 | bwd: 3579.31 | bwd_inner: 3568.78 | bwd_allreduce: 10.34 | step: 61.21 13%|█▎ | 77/600 [14:04<1:31:10, 10.46s/it] {'loss': 0.9121, 'learning_rate': 9.748566060992847e-05, 'epoch': 0.77} 13%|█▎ | 77/600 [14:04<1:31:10, 10.46s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0662, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1406, device='cuda:0', grad_fn=) [2024-06-18 22:29:16,789] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3478.48 | bwd_microstep: 1738.71 | bwd_inner_microstep: 1733.49 | bwd_allreduce_microstep: 5.05 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.5025, device='cuda:0', grad_fn=) tensor(0.7039, device='cuda:0', grad_fn=) tensor(1.4226, device='cuda:0', grad_fn=) [2024-06-18 22:29:22,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:29:22,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3564.17 | bwd_microstep: 1925.52 | bwd_inner_microstep: 1919.90 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.89 [2024-06-18 22:29:22,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7042.63 | bwd: 3664.23 | bwd_inner: 3653.47 | bwd_allreduce: 10.49 | step: 61.98 13%|█▎ | 78/600 [14:15<1:32:19, 10.61s/it] {'loss': 0.7816, 'learning_rate': 9.740045899781352e-05, 'epoch': 0.78} 13%|█▎ | 78/600 [14:15<1:32:19, 10.61s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0759, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1498, device='cuda:0', grad_fn=) [2024-06-18 22:29:27,006] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2840.71 | bwd_microstep: 1631.83 | bwd_inner_microstep: 1626.83 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0223, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1015, device='cuda:0', grad_fn=) [2024-06-18 22:29:32,468] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 22:29:32,469] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3501.16 | bwd_microstep: 1808.90 | bwd_inner_microstep: 1803.21 | bwd_allreduce_microstep: 5.57 | step_microstep: 62.38 [2024-06-18 22:29:32,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6341.85 | bwd: 3440.72 | bwd_inner: 3430.06 | bwd_allreduce: 10.46 | step: 62.46 13%|█▎ | 79/600 [14:25<1:30:37, 10.44s/it] {'loss': 0.1256, 'learning_rate': 9.731387625344104e-05, 'epoch': 0.79} 13%|█▎ | 79/600 [14:25<1:30:37, 10.44s/it]warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1956, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(1.1456, device='cuda:0', grad_fn=) [2024-06-18 22:29:37,455] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3171.77 | bwd_microstep: 1717.39 | bwd_inner_microstep: 1712.40 | bwd_allreduce_microstep: 4.82 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2406, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.1750, device='cuda:0', grad_fn=) [2024-06-18 22:29:43,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 22:29:43,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3585.51 | bwd_microstep: 1972.11 | bwd_inner_microstep: 1966.60 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.38 [2024-06-18 22:29:43,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6757.27 | bwd: 3689.49 | bwd_inner: 3679.04 | bwd_allreduce: 10.23 | step: 61.45 13%|█▎ | 80/600 [14:36<1:31:09, 10.52s/it] {'loss': 1.1603, 'learning_rate': 9.722591489961827e-05, 'epoch': 0.8} 13%|█▎ | 80/600 [14:36<1:31:09, 10.52s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2464, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.1918, device='cuda:0', grad_fn=) [2024-06-18 22:29:48,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3564.61 | bwd_microstep: 1928.10 | bwd_inner_microstep: 1922.95 | bwd_allreduce_microstep: 4.98 | step_microstep: 0.13 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2047, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.1427, device='cuda:0', grad_fn=) [2024-06-18 22:29:54,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:29:54,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3564.05 | bwd_microstep: 1931.38 | bwd_inner_microstep: 1925.81 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.53 [2024-06-18 22:29:54,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7128.65 | bwd: 3859.48 | bwd_inner: 3848.84 | bwd_allreduce: 10.36 | step: 61.68 14%|█▎ | 81/600 [14:47<1:32:53, 10.74s/it] {'loss': 1.1672, 'learning_rate': 9.713657749932172e-05, 'epoch': 0.81} 14%|█▎ | 81/600 [14:47<1:32:53, 10.74s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0629, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1381, device='cuda:0', grad_fn=) [2024-06-18 22:29:59,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3501.66 | bwd_microstep: 1806.02 | bwd_inner_microstep: 1801.07 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.3947, device='cuda:0', grad_fn=) tensor(0.7039, device='cuda:0', grad_fn=) tensor(1.3257, device='cuda:0', grad_fn=) [2024-06-18 22:30:05,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:30:05,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.15 | bwd_microstep: 1891.32 | bwd_inner_microstep: 1885.68 | bwd_allreduce_microstep: 5.52 | step_microstep: 61.63 [2024-06-18 22:30:05,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7050.79 | bwd: 3697.33 | bwd_inner: 3686.75 | bwd_allreduce: 10.40 | step: 61.71 14%|█▎ | 82/600 [14:58<1:33:23, 10.82s/it] {'loss': 0.7319, 'learning_rate': 9.70458666556225e-05, 'epoch': 0.82} 14%|█▎ | 82/600 [14:58<1:33:23, 10.82s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7386, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7347, device='cuda:0', grad_fn=) [2024-06-18 22:30:11,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3554.63 | bwd_microstep: 1916.04 | bwd_inner_microstep: 1911.01 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(1.0013, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9597, device='cuda:0', grad_fn=) [2024-06-18 22:30:15,947] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 22:30:15,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2946.36 | bwd_microstep: 1835.01 | bwd_inner_microstep: 1829.51 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.58 [2024-06-18 22:30:15,948] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6500.96 | bwd: 3751.05 | bwd_inner: 3740.57 | bwd_allreduce: 10.27 | step: 61.66 14%|█▍ | 83/600 [15:09<1:32:25, 10.73s/it] {'loss': 0.8472, 'learning_rate': 9.695378501161045e-05, 'epoch': 0.83} 14%|█▍ | 83/600 [15:09<1:32:25, 10.73s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0595, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0235, device='cuda:0', grad_fn=) [2024-06-18 22:30:21,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3555.28 | bwd_microstep: 1899.69 | bwd_inner_microstep: 1894.54 | bwd_allreduce_microstep: 5.05 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1469, device='cuda:0', grad_fn=) tensor(0.5814, device='cuda:0', grad_fn=) tensor(1.0904, device='cuda:0', grad_fn=) [2024-06-18 22:30:27,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 22:30:27,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3522.48 | bwd_microstep: 1849.00 | bwd_inner_microstep: 1843.55 | bwd_allreduce_microstep: 5.37 | step_microstep: 61.93 [2024-06-18 22:30:27,037] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7077.73 | bwd: 3748.68 | bwd_inner: 3738.07 | bwd_allreduce: 10.44 | step: 62.02 14%|█▍ | 84/600 [15:20<1:33:10, 10.84s/it] {'loss': 1.0569, 'learning_rate': 9.686033525031719e-05, 'epoch': 0.84} 14%|█▍ | 84/600 [15:20<1:33:10, 10.84s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0597, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1352, device='cuda:0', grad_fn=) [2024-06-18 22:30:32,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3475.80 | bwd_microstep: 1740.31 | bwd_inner_microstep: 1735.18 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.3800, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.3120, device='cuda:0', grad_fn=) [2024-06-18 22:30:37,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:30:37,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.63 | bwd_microstep: 1887.07 | bwd_inner_microstep: 1881.52 | bwd_allreduce_microstep: 5.37 | step_microstep: 61.33 [2024-06-18 22:30:37,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7025.41 | bwd: 3627.36 | bwd_inner: 3616.79 | bwd_allreduce: 10.29 | step: 61.41 14%|█▍ | 85/600 [15:31<1:33:10, 10.85s/it] {'loss': 0.7236, 'learning_rate': 9.676552009463783e-05, 'epoch': 0.85} 14%|█▍ | 85/600 [15:31<1:33:10, 10.85s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9669, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9402, device='cuda:0', grad_fn=) [2024-06-18 22:30:43,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3578.42 | bwd_microstep: 1961.06 | bwd_inner_microstep: 1956.04 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1088, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0564, device='cuda:0', grad_fn=) [2024-06-18 22:30:49,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:30:49,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3564.82 | bwd_microstep: 1927.66 | bwd_inner_microstep: 1922.12 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.19 [2024-06-18 22:30:49,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7143.22 | bwd: 3888.71 | bwd_inner: 3878.15 | bwd_allreduce: 10.38 | step: 61.27 14%|█▍ | 86/600 [15:42<1:34:08, 10.99s/it] {'loss': 0.9983, 'learning_rate': 9.66693423072518e-05, 'epoch': 0.86} 14%|█▍ | 86/600 [15:42<1:34:08, 10.99s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0286, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1073, device='cuda:0', grad_fn=) [2024-06-18 22:30:53,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2839.89 | bwd_microstep: 1629.49 | bwd_inner_microstep: 1624.48 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1357, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0806, device='cuda:0', grad_fn=) [2024-06-18 22:30:59,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 22:30:59,473] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3570.45 | bwd_microstep: 1941.56 | bwd_inner_microstep: 1936.04 | bwd_allreduce_microstep: 5.41 | step_microstep: 61.12 [2024-06-18 22:30:59,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6410.33 | bwd: 3571.04 | bwd_inner: 3560.57 | bwd_allreduce: 10.27 | step: 61.20 14%|█▍ | 87/600 [15:52<1:32:01, 10.76s/it] {'loss': 0.5939, 'learning_rate': 9.657180469054213e-05, 'epoch': 0.87} 14%|█▍ | 87/600 [15:52<1:32:01, 10.76s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0498, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1263, device='cuda:0', grad_fn=) [2024-06-18 22:31:04,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3415.55 | bwd_microstep: 1639.33 | bwd_inner_microstep: 1634.29 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2372, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(1.1722, device='cuda:0', grad_fn=) [2024-06-18 22:31:10,272] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 22:31:10,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3568.32 | bwd_microstep: 1924.62 | bwd_inner_microstep: 1918.92 | bwd_allreduce_microstep: 5.58 | step_microstep: 61.54 [2024-06-18 22:31:10,273] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6983.86 | bwd: 3563.94 | bwd_inner: 3553.27 | bwd_allreduce: 10.44 | step: 61.63 15%|█▍ | 88/600 [16:03<1:31:56, 10.77s/it] {'loss': 0.6493, 'learning_rate': 9.647291008651398e-05, 'epoch': 0.88} 15%|█▍ | 88/600 [16:03<1:31:56, 10.77s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0221, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1014, device='cuda:0', grad_fn=) [2024-06-18 22:31:15,677] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3498.42 | bwd_microstep: 1810.81 | bwd_inner_microstep: 1805.80 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(1.1216, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0679, device='cuda:0', grad_fn=) [2024-06-18 22:31:20,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 22:31:20,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2781.14 | bwd_microstep: 1880.40 | bwd_inner_microstep: 1874.87 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.52 [2024-06-18 22:31:20,504] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6279.51 | bwd: 3691.20 | bwd_inner: 3680.71 | bwd_allreduce: 10.30 | step: 61.61 15%|█▍ | 89/600 [16:13<1:30:22, 10.61s/it] {'loss': 0.5846, 'learning_rate': 9.637266137671177e-05, 'epoch': 0.89} 15%|█▍ | 89/600 [16:13<1:30:22, 10.61s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.3925, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.3232, device='cuda:0', grad_fn=) [2024-06-18 22:31:26,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3583.22 | bwd_microstep: 1963.89 | bwd_inner_microstep: 1958.84 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0412, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1186, device='cuda:0', grad_fn=) [2024-06-18 22:31:31,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 22:31:31,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3469.50 | bwd_microstep: 1726.13 | bwd_inner_microstep: 1720.49 | bwd_allreduce_microstep: 5.46 | step_microstep: 62.02 [2024-06-18 22:31:31,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7052.70 | bwd: 3690.02 | bwd_inner: 3679.42 | bwd_allreduce: 10.34 | step: 62.10 15%|█▌ | 90/600 [16:24<1:31:10, 10.73s/it] {'loss': 0.7209, 'learning_rate': 9.627106148213522e-05, 'epoch': 0.9} 15%|█▌ | 90/600 [16:24<1:31:10, 10.73s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.3394, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.2755, device='cuda:0', grad_fn=) [2024-06-18 22:31:37,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3546.87 | bwd_microstep: 1886.57 | bwd_inner_microstep: 1881.51 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0668, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0185, device='cuda:0', grad_fn=) [2024-06-18 22:31:42,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:31:42,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3570.67 | bwd_microstep: 1939.48 | bwd_inner_microstep: 1933.85 | bwd_allreduce_microstep: 5.47 | step_microstep: 61.33 [2024-06-18 22:31:42,710] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7117.51 | bwd: 3826.04 | bwd_inner: 3815.45 | bwd_allreduce: 10.31 | step: 61.42 15%|█▌ | 91/600 [16:35<1:32:13, 10.87s/it] {'loss': 1.147, 'learning_rate': 9.61681133631542e-05, 'epoch': 0.91} 15%|█▌ | 91/600 [16:35<1:32:13, 10.87s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.3381, device='cuda:0', grad_fn=) tensor(0.7039, device='cuda:0', grad_fn=) tensor(1.2747, device='cuda:0', grad_fn=) [2024-06-18 22:31:48,256] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3551.64 | bwd_microstep: 1894.51 | bwd_inner_microstep: 1889.46 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1686, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(1.1105, device='cuda:0', grad_fn=) [2024-06-18 22:31:54,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:31:54,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3593.74 | bwd_microstep: 1981.60 | bwd_inner_microstep: 1976.02 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.32 [2024-06-18 22:31:54,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7145.36 | bwd: 3876.10 | bwd_inner: 3865.57 | bwd_allreduce: 10.31 | step: 61.40 15%|█▌ | 92/600 [16:47<1:33:06, 11.00s/it] {'loss': 1.1926, 'learning_rate': 9.606382001942255e-05, 'epoch': 0.92} 15%|█▌ | 92/600 [16:47<1:33:06, 11.00s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.0331, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1112, device='cuda:0', grad_fn=) [2024-06-18 22:31:56,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1699.46 | bwd_microstep: 820.71 | bwd_inner_microstep: 815.86 | bwd_allreduce_microstep: 4.73 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(1.3194, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.2459, device='cuda:0', grad_fn=) [2024-06-18 22:32:01,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 22:32:01,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2794.01 | bwd_microstep: 1878.73 | bwd_inner_microstep: 1873.05 | bwd_allreduce_microstep: 5.54 | step_microstep: 61.95 [2024-06-18 22:32:01,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4493.46 | bwd: 2699.43 | bwd_inner: 2688.93 | bwd_allreduce: 10.28 | step: 62.02 16%|█▌ | 93/600 [16:54<1:23:51, 9.92s/it] {'loss': 0.6786, 'learning_rate': 9.595818448979061e-05, 'epoch': 0.93} 16%|█▌ | 93/600 [16:54<1:23:51, 9.92s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.3731, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.3058, device='cuda:0', grad_fn=) [2024-06-18 22:32:06,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3544.59 | bwd_microstep: 1887.70 | bwd_inner_microstep: 1882.75 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0349, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1129, device='cuda:0', grad_fn=) [2024-06-18 22:32:11,645] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 22:32:11,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2867.84 | bwd_microstep: 1675.20 | bwd_inner_microstep: 1669.67 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.31 [2024-06-18 22:32:11,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6412.41 | bwd: 3562.89 | bwd_inner: 3552.45 | bwd_allreduce: 10.28 | step: 61.39 16%|█▌ | 94/600 [17:04<1:24:27, 10.01s/it] {'loss': 0.7093, 'learning_rate': 9.585120985221671e-05, 'epoch': 0.94} 16%|█▌ | 94/600 [17:04<1:24:27, 10.01s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0263, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.1056, device='cuda:0', grad_fn=) [2024-06-18 22:32:17,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3493.45 | bwd_microstep: 1804.07 | bwd_inner_microstep: 1798.89 | bwd_allreduce_microstep: 5.07 | step_microstep: 0.13 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9988, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9574, device='cuda:0', grad_fn=) [2024-06-18 22:32:22,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:32:22,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.56 | bwd_microstep: 1927.51 | bwd_inner_microstep: 1921.91 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.72 [2024-06-18 22:32:22,689] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7052.97 | bwd: 3731.58 | bwd_inner: 3720.86 | bwd_allreduce: 10.48 | step: 61.86 16%|█▌ | 95/600 [17:15<1:26:53, 10.32s/it] {'loss': 0.5315, 'learning_rate': 9.574289922367749e-05, 'epoch': 0.95} 16%|█▌ | 95/600 [17:15<1:26:53, 10.32s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8429, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8285, device='cuda:0', grad_fn=) [2024-06-18 22:32:28,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3554.52 | bwd_microstep: 1918.79 | bwd_inner_microstep: 1913.85 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6527, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6574, device='cuda:0', grad_fn=) [2024-06-18 22:32:33,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:32:33,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3563.30 | bwd_microstep: 1920.82 | bwd_inner_microstep: 1915.30 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.16 [2024-06-18 22:32:33,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7117.80 | bwd: 3839.61 | bwd_inner: 3829.20 | bwd_allreduce: 10.23 | step: 61.24 16%|█▌ | 96/600 [17:27<1:28:57, 10.59s/it] {'loss': 0.743, 'learning_rate': 9.563325576007701e-05, 'epoch': 0.96} 16%|█▌ | 96/600 [17:27<1:28:57, 10.59s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9689, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9420, device='cuda:0', grad_fn=) [2024-06-18 22:32:39,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.75 | bwd_microstep: 1908.42 | bwd_inner_microstep: 1903.14 | bwd_allreduce_microstep: 5.16 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0930, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0422, device='cuda:0', grad_fn=) [2024-06-18 22:32:45,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 22:32:45,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3566.32 | bwd_microstep: 1931.66 | bwd_inner_microstep: 1926.06 | bwd_allreduce_microstep: 5.48 | step_microstep: 63.67 [2024-06-18 22:32:45,136] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7123.06 | bwd: 3840.07 | bwd_inner: 3829.22 | bwd_allreduce: 10.65 | step: 63.76 16%|█▌ | 97/600 [17:38<1:30:23, 10.78s/it] {'loss': 0.9921, 'learning_rate': 9.552228265615492e-05, 'epoch': 0.97} 16%|█▌ | 97/600 [17:38<1:30:23, 10.78s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) tensor(1.5627, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.4764, device='cuda:0', grad_fn=) [2024-06-18 22:32:50,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3278.71 | bwd_microstep: 1924.51 | bwd_inner_microstep: 1919.46 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0383, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1159, device='cuda:0', grad_fn=) [2024-06-18 22:32:55,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.94 [2024-06-18 22:32:55,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3509.14 | bwd_microstep: 1807.37 | bwd_inner_microstep: 1801.69 | bwd_allreduce_microstep: 5.58 | step_microstep: 61.94 [2024-06-18 22:32:55,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6787.83 | bwd: 3731.86 | bwd_inner: 3721.15 | bwd_allreduce: 10.52 | step: 62.02 16%|█▋ | 98/600 [17:49<1:30:12, 10.78s/it] {'loss': 0.7962, 'learning_rate': 9.540998314539328e-05, 'epoch': 0.98} 16%|█▋ | 98/600 [17:49<1:30:12, 10.78s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0341, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1122, device='cuda:0', grad_fn=) [2024-06-18 22:33:01,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3475.01 | bwd_microstep: 1729.13 | bwd_inner_microstep: 1724.14 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.5473, device='cuda:0', grad_fn=) tensor(0.5782, device='cuda:0', grad_fn=) tensor(1.4504, device='cuda:0', grad_fn=) [2024-06-18 22:33:06,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:33:06,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2949.15 | bwd_microstep: 1843.31 | bwd_inner_microstep: 1837.79 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.64 [2024-06-18 22:33:06,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6424.14 | bwd: 3572.44 | bwd_inner: 3561.92 | bwd_allreduce: 10.33 | step: 61.73 16%|█▋ | 99/600 [17:59<1:28:41, 10.62s/it] {'loss': 0.7813, 'learning_rate': 9.529636049992234e-05, 'epoch': 0.99} 16%|█▋ | 99/600 [17:59<1:28:41, 10.62s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1049, device='cuda:0', grad_fn=) tensor(0.6965, device='cuda:0', grad_fn=) tensor(1.0640, device='cuda:0', grad_fn=) [2024-06-18 22:33:11,756] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3563.11 | bwd_microstep: 1922.78 | bwd_inner_microstep: 1917.76 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.14 please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9905, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9614, device='cuda:0', grad_fn=) [2024-06-18 22:33:18,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:33:18,236] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3555.59 | bwd_microstep: 1918.57 | bwd_inner_microstep: 1912.87 | bwd_allreduce_microstep: 5.59 | step_microstep: 62.32 [2024-06-18 22:33:18,237] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7118.66 | bwd: 3841.35 | bwd_inner: 3830.63 | bwd_allreduce: 10.56 | step: 62.46 17%|█▋ | 100/600 [18:11<1:32:08, 11.06s/it] {'loss': 1.0127, 'learning_rate': 9.518141803042527e-05, 'epoch': 1.0} 17%|█▋ | 100/600 [18:11<1:32:08, 11.06s/it][2024-06-18 22:33:20,923] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-18 22:33:26,889] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-18 22:33:32,769] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-18 22:33:38,641] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9167, device='cuda:0', grad_fn=) tensor(0.7039, device='cuda:0', grad_fn=) tensor(0.8954, device='cuda:0', grad_fn=) [2024-06-18 22:33:47,754] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3596.26 | bwd_microstep: 1951.21 | bwd_inner_microstep: 1946.08 | bwd_allreduce_microstep: 5.01 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1642, device='cuda:0', grad_fn=) tensor(0.6965, device='cuda:0', grad_fn=) tensor(1.1175, device='cuda:0', grad_fn=) [2024-06-18 22:33:52,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 22:33:52,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2684.65 | bwd_microstep: 1646.05 | bwd_inner_microstep: 1640.41 | bwd_allreduce_microstep: 5.53 | step_microstep: 61.75 [2024-06-18 22:33:52,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6280.89 | bwd: 3597.26 | bwd_inner: 3586.51 | bwd_allreduce: 10.55 | step: 61.84 17%|█▋ | 101/600 [18:45<2:29:12, 17.94s/it] {'loss': 1.0064, 'learning_rate': 9.506515908604162e-05, 'epoch': 1.01} 17%|█▋ | 101/600 [18:45<2:29:12, 17.94s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.3394, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.2639, device='cuda:0', grad_fn=) [2024-06-18 22:33:58,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3651.48 | bwd_microstep: 2093.14 | bwd_inner_microstep: 2087.90 | bwd_allreduce_microstep: 5.12 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9937, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.9639, device='cuda:0', grad_fn=) [2024-06-18 22:34:03,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.99 [2024-06-18 22:34:03,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.21 | bwd_microstep: 1887.26 | bwd_inner_microstep: 1881.55 | bwd_allreduce_microstep: 5.60 | step_microstep: 62.46 [2024-06-18 22:34:03,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7207.68 | bwd: 3980.39 | bwd_inner: 3969.47 | bwd_allreduce: 10.73 | step: 62.55 17%|█▋ | 102/600 [18:56<2:12:46, 16.00s/it] {'loss': 1.1139, 'learning_rate': 9.494758705426978e-05, 'epoch': 1.02} 17%|█▋ | 102/600 [18:56<2:12:46, 16.00s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1501, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.2169, device='cuda:0', grad_fn=) [2024-06-18 22:34:09,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3499.40 | bwd_microstep: 1801.26 | bwd_inner_microstep: 1796.17 | bwd_allreduce_microstep: 4.98 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9592, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9217, device='cuda:0', grad_fn=) [2024-06-18 22:34:14,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:34:14,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3596.20 | bwd_microstep: 1984.67 | bwd_inner_microstep: 1979.12 | bwd_allreduce_microstep: 5.44 | step_microstep: 62.05 [2024-06-18 22:34:14,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7095.56 | bwd: 3785.93 | bwd_inner: 3775.31 | bwd_allreduce: 10.42 | step: 62.13 17%|█▋ | 103/600 [19:08<2:00:27, 14.54s/it] {'loss': 0.5693, 'learning_rate': 9.482870536086823e-05, 'epoch': 1.03} 17%|█▋ | 103/600 [19:08<2:00:27, 14.54s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0338, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0004, device='cuda:0', grad_fn=) [2024-06-18 22:34:20,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3565.73 | bwd_microstep: 1916.72 | bwd_inner_microstep: 1911.56 | bwd_allreduce_microstep: 5.05 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0437, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1208, device='cuda:0', grad_fn=) [2024-06-18 22:34:25,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:34:25,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3458.04 | bwd_microstep: 1693.53 | bwd_inner_microstep: 1687.93 | bwd_allreduce_microstep: 5.50 | step_microstep: 62.23 [2024-06-18 22:34:25,734] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7023.75 | bwd: 3610.24 | bwd_inner: 3599.48 | bwd_allreduce: 10.55 | step: 62.32 17%|█▋ | 104/600 [19:18<1:51:08, 13.45s/it] {'loss': 0.5606, 'learning_rate': 9.470851746975582e-05, 'epoch': 1.04} 17%|█▋ | 104/600 [19:18<1:51:08, 13.45s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0144, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.0948, device='cuda:0', grad_fn=) [2024-06-18 22:34:31,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3502.06 | bwd_microstep: 1803.32 | bwd_inner_microstep: 1798.22 | bwd_allreduce_microstep: 4.93 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9351, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9116, device='cuda:0', grad_fn=) [2024-06-18 22:34:36,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 22:34:36,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3564.81 | bwd_microstep: 1911.18 | bwd_inner_microstep: 1905.56 | bwd_allreduce_microstep: 5.51 | step_microstep: 61.71 [2024-06-18 22:34:36,767] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7066.83 | bwd: 3714.50 | bwd_inner: 3703.84 | bwd_allreduce: 10.44 | step: 61.79 18%|█▊ | 105/600 [19:29<1:44:57, 12.72s/it] {'loss': 0.5032, 'learning_rate': 9.458702688291073e-05, 'epoch': 1.05} 18%|█▊ | 105/600 [19:29<1:44:57, 12.72s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.0289, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1075, device='cuda:0', grad_fn=) [2024-06-18 22:34:41,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2848.05 | bwd_microstep: 1633.77 | bwd_inner_microstep: 1628.70 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1003, device='cuda:0', grad_fn=) tensor(0.5782, device='cuda:0', grad_fn=) tensor(1.0481, device='cuda:0', grad_fn=) [2024-06-18 22:34:47,019] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.93 [2024-06-18 22:34:47,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3582.20 | bwd_microstep: 1931.09 | bwd_inner_microstep: 1925.28 | bwd_allreduce_microstep: 5.69 | step_microstep: 62.99 [2024-06-18 22:34:47,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6430.24 | bwd: 3564.86 | bwd_inner: 3554.04 | bwd_allreduce: 10.61 | step: 63.07 18%|█▊ | 106/600 [19:40<1:38:38, 11.98s/it] {'loss': 0.5778, 'learning_rate': 9.446423714026846e-05, 'epoch': 1.06} 18%|█▊ | 106/600 [19:40<1:38:38, 11.98s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1559, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.1103, device='cuda:0', grad_fn=) [2024-06-18 22:34:51,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2663.02 | bwd_microstep: 1610.56 | bwd_inner_microstep: 1605.08 | bwd_allreduce_microstep: 5.37 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) tensor(0.0255, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.1049, device='cuda:0', grad_fn=) [2024-06-18 22:34:56,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 22:34:56,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3196.37 | bwd_microstep: 1771.06 | bwd_inner_microstep: 1765.49 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.73 [2024-06-18 22:34:56,509] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5859.36 | bwd: 3381.61 | bwd_inner: 3370.59 | bwd_allreduce: 10.83 | step: 61.81 18%|█▊ | 107/600 [19:49<1:32:18, 11.23s/it] {'loss': 0.6076, 'learning_rate': 9.434015181961873e-05, 'epoch': 1.07} 18%|█▊ | 107/600 [19:49<1:32:18, 11.23s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1070, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0663, device='cuda:0', grad_fn=) [2024-06-18 22:35:00,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2691.15 | bwd_microstep: 1658.19 | bwd_inner_microstep: 1653.15 | bwd_allreduce_microstep: 4.93 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0153, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0952, device='cuda:0', grad_fn=) [2024-06-18 22:35:06,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 22:35:06,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3484.56 | bwd_microstep: 1739.38 | bwd_inner_microstep: 1733.80 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.98 [2024-06-18 22:35:06,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6175.68 | bwd: 3397.57 | bwd_inner: 3387.00 | bwd_allreduce: 10.34 | step: 62.06 18%|█▊ | 108/600 [19:59<1:28:38, 10.81s/it] {'loss': 0.5808, 'learning_rate': 9.421477453650118e-05, 'epoch': 1.08} 18%|█▊ | 108/600 [19:59<1:28:38, 10.81s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0320, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1103, device='cuda:0', grad_fn=) [2024-06-18 22:35:11,646] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3481.03 | bwd_microstep: 1740.90 | bwd_inner_microstep: 1735.76 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0388, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0049, device='cuda:0', grad_fn=) [2024-06-18 22:35:17,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:35:17,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3560.20 | bwd_microstep: 1886.74 | bwd_inner_microstep: 1881.16 | bwd_allreduce_microstep: 5.47 | step_microstep: 61.73 [2024-06-18 22:35:17,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7041.22 | bwd: 3627.63 | bwd_inner: 3616.98 | bwd_allreduce: 10.43 | step: 61.81 18%|█▊ | 109/600 [20:10<1:28:43, 10.84s/it] {'loss': 0.5576, 'learning_rate': 9.408810894410009e-05, 'epoch': 1.09} 18%|█▊ | 109/600 [20:10<1:28:43, 10.84s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1498, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(1.0937, device='cuda:0', grad_fn=) [2024-06-18 22:35:22,818] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.53 | bwd_microstep: 1905.34 | bwd_inner_microstep: 1900.12 | bwd_allreduce_microstep: 5.07 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1274, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0731, device='cuda:0', grad_fn=) [2024-06-18 22:35:28,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:35:28,584] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3603.60 | bwd_microstep: 1990.10 | bwd_inner_microstep: 1984.54 | bwd_allreduce_microstep: 5.44 | step_microstep: 62.18 [2024-06-18 22:35:28,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7163.11 | bwd: 3895.43 | bwd_inner: 3884.72 | bwd_allreduce: 10.52 | step: 62.27 18%|█▊ | 110/600 [20:21<1:29:45, 10.99s/it] {'loss': 1.0834, 'learning_rate': 9.396015873313781e-05, 'epoch': 1.1} 18%|█▊ | 110/600 [20:21<1:29:45, 10.99s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5806, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5925, device='cuda:0', grad_fn=) [2024-06-18 22:35:34,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3565.34 | bwd_microstep: 1920.77 | bwd_inner_microstep: 1915.62 | bwd_allreduce_microstep: 5.03 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1903, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.1298, device='cuda:0', grad_fn=) [2024-06-18 22:35:39,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.95 [2024-06-18 22:35:39,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3569.08 | bwd_microstep: 1906.14 | bwd_inner_microstep: 1900.26 | bwd_allreduce_microstep: 5.75 | step_microstep: 62.32 [2024-06-18 22:35:39,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7134.40 | bwd: 3826.90 | bwd_inner: 3815.90 | bwd_allreduce: 10.80 | step: 62.40 18%|█▊ | 111/600 [20:32<1:30:09, 11.06s/it] {'loss': 0.8611, 'learning_rate': 9.38309276317674e-05, 'epoch': 1.11} 18%|█▊ | 111/600 [20:32<1:30:09, 11.06s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0577, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(1.0107, device='cuda:0', grad_fn=) [2024-06-18 22:35:45,387] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3563.85 | bwd_microstep: 1903.84 | bwd_inner_microstep: 1898.60 | bwd_allreduce_microstep: 5.14 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8431, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8288, device='cuda:0', grad_fn=) [2024-06-18 22:35:51,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 22:35:51,045] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3569.63 | bwd_microstep: 1924.76 | bwd_inner_microstep: 1919.13 | bwd_allreduce_microstep: 5.51 | step_microstep: 62.70 [2024-06-18 22:35:51,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7133.46 | bwd: 3828.60 | bwd_inner: 3817.74 | bwd_allreduce: 10.66 | step: 62.79 19%|█▊ | 112/600 [20:44<1:30:23, 11.11s/it] {'loss': 0.9198, 'learning_rate': 9.37004194054638e-05, 'epoch': 1.12} 19%|█▊ | 112/600 [20:44<1:30:23, 11.11s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0094, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0895, device='cuda:0', grad_fn=) [2024-06-18 22:35:56,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3506.80 | bwd_microstep: 1808.99 | bwd_inner_microstep: 1803.70 | bwd_allreduce_microstep: 5.18 | step_microstep: 0.14 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0289, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1071, device='cuda:0', grad_fn=) [2024-06-18 22:36:01,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:36:01,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2896.59 | bwd_microstep: 1741.56 | bwd_inner_microstep: 1735.94 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.87 [2024-06-18 22:36:01,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6403.34 | bwd: 3550.55 | bwd_inner: 3539.70 | bwd_allreduce: 10.64 | step: 62.01 19%|█▉ | 113/600 [20:54<1:27:59, 10.84s/it] {'loss': 0.0983, 'learning_rate': 9.356863785691428e-05, 'epoch': 1.13} 19%|█▉ | 113/600 [20:54<1:27:59, 10.84s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0172, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0969, device='cuda:0', grad_fn=) [2024-06-18 22:36:06,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3459.19 | bwd_microstep: 1693.45 | bwd_inner_microstep: 1688.46 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.9467, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.9108, device='cuda:0', grad_fn=) [2024-06-18 22:36:11,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 22:36:11,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2956.23 | bwd_microstep: 1834.62 | bwd_inner_microstep: 1829.03 | bwd_allreduce_microstep: 5.48 | step_microstep: 61.94 [2024-06-18 22:36:11,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6415.39 | bwd: 3528.07 | bwd_inner: 3517.49 | bwd_allreduce: 10.38 | step: 62.02 19%|█▉ | 114/600 [21:04<1:26:14, 10.65s/it] {'loss': 0.5039, 'learning_rate': 9.343558682590756e-05, 'epoch': 1.14} 19%|█▉ | 114/600 [21:04<1:26:14, 10.65s/it]warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0248, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1038, device='cuda:0', grad_fn=) [2024-06-18 22:36:16,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3146.35 | bwd_microstep: 1661.10 | bwd_inner_microstep: 1655.76 | bwd_allreduce_microstep: 5.23 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9443, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9084, device='cuda:0', grad_fn=) [2024-06-18 22:36:22,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:36:22,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3575.33 | bwd_microstep: 1939.09 | bwd_inner_microstep: 1933.46 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.83 [2024-06-18 22:36:22,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6721.67 | bwd: 3600.18 | bwd_inner: 3589.27 | bwd_allreduce: 10.69 | step: 61.92 19%|█▉ | 115/600 [21:15<1:25:53, 10.63s/it] {'loss': 0.5061, 'learning_rate': 9.330127018922194e-05, 'epoch': 1.15} 19%|█▉ | 115/600 [21:15<1:25:53, 10.63s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0700, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(1.0325, device='cuda:0', grad_fn=) [2024-06-18 22:36:26,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2851.09 | bwd_microstep: 1641.30 | bwd_inner_microstep: 1636.25 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0545, device='cuda:0', grad_fn=) tensor(0.6965, device='cuda:0', grad_fn=) tensor(1.0187, device='cuda:0', grad_fn=) [2024-06-18 22:36:32,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 22:36:32,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3567.63 | bwd_microstep: 1922.55 | bwd_inner_microstep: 1916.77 | bwd_allreduce_microstep: 5.66 | step_microstep: 62.64 [2024-06-18 22:36:32,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6418.69 | bwd: 3563.85 | bwd_inner: 3553.05 | bwd_allreduce: 10.60 | step: 62.72 19%|█▉ | 116/600 [21:25<1:24:47, 10.51s/it] {'loss': 1.0256, 'learning_rate': 9.316569186051234e-05, 'epoch': 1.16} 19%|█▉ | 116/600 [21:25<1:24:47, 10.51s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8808, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8627, device='cuda:0', grad_fn=) [2024-06-18 22:36:37,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3555.19 | bwd_microstep: 1892.17 | bwd_inner_microstep: 1886.91 | bwd_allreduce_microstep: 5.15 | step_microstep: 0.09 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(1.0515, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0048, device='cuda:0', grad_fn=) [2024-06-18 22:36:42,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 22:36:42,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2797.52 | bwd_microstep: 1879.75 | bwd_inner_microstep: 1874.20 | bwd_allreduce_microstep: 5.44 | step_microstep: 62.08 [2024-06-18 22:36:42,661] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6352.69 | bwd: 3771.92 | bwd_inner: 3761.13 | bwd_allreduce: 10.60 | step: 62.18 20%|█▉ | 117/600 [21:35<1:24:19, 10.48s/it] {'loss': 0.9337, 'learning_rate': 9.302885579019627e-05, 'epoch': 1.17} 20%|█▉ | 117/600 [21:35<1:24:19, 10.48s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0267, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1056, device='cuda:0', grad_fn=) [2024-06-18 22:36:47,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2711.54 | bwd_microstep: 1716.86 | bwd_inner_microstep: 1711.81 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0950, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0440, device='cuda:0', grad_fn=) [2024-06-18 22:36:52,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 22:36:52,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3596.73 | bwd_microstep: 1980.68 | bwd_inner_microstep: 1975.05 | bwd_allreduce_microstep: 5.50 | step_microstep: 62.28 [2024-06-18 22:36:52,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6308.23 | bwd: 3697.54 | bwd_inner: 3686.88 | bwd_allreduce: 10.43 | step: 62.36 20%|█▉ | 118/600 [21:46<1:23:39, 10.41s/it] {'loss': 0.5748, 'learning_rate': 9.289076596533872e-05, 'epoch': 1.18} 20%|█▉ | 118/600 [21:46<1:23:39, 10.41s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0129, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0931, device='cuda:0', grad_fn=) [2024-06-18 22:36:58,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3502.36 | bwd_microstep: 1806.16 | bwd_inner_microstep: 1801.02 | bwd_allreduce_microstep: 5.00 | step_microstep: 0.09 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) tensor(1.1232, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0808, device='cuda:0', grad_fn=) [2024-06-18 22:37:03,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 22:37:03,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3156.48 | bwd_microstep: 1680.38 | bwd_inner_microstep: 1674.83 | bwd_allreduce_microstep: 5.37 | step_microstep: 62.05 [2024-06-18 22:37:03,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6658.81 | bwd: 3486.53 | bwd_inner: 3475.93 | bwd_allreduce: 10.38 | step: 62.15 20%|█▉ | 119/600 [21:56<1:23:26, 10.41s/it] {'loss': 0.587, 'learning_rate': 9.2751426409536e-05, 'epoch': 1.19} 20%|█▉ | 119/600 [21:56<1:23:26, 10.41s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2705, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.2134, device='cuda:0', grad_fn=) [2024-06-18 22:37:08,930] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3570.47 | bwd_microstep: 1926.27 | bwd_inner_microstep: 1921.23 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0192, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9872, device='cuda:0', grad_fn=) [2024-06-18 22:37:14,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.93 [2024-06-18 22:37:14,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.93 | bwd_microstep: 1892.60 | bwd_inner_microstep: 1886.72 | bwd_allreduce_microstep: 5.75 | step_microstep: 64.42 [2024-06-18 22:37:14,539] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7127.39 | bwd: 3818.86 | bwd_inner: 3807.96 | bwd_allreduce: 10.70 | step: 64.50 20%|██ | 120/600 [22:07<1:25:11, 10.65s/it] {'loss': 1.1003, 'learning_rate': 9.261084118279847e-05, 'epoch': 1.2} 20%|██ | 120/600 [22:07<1:25:11, 10.65s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.2590, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.3146, device='cuda:0', grad_fn=) [2024-06-18 22:37:19,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3478.35 | bwd_microstep: 1728.00 | bwd_inner_microstep: 1720.98 | bwd_allreduce_microstep: 6.85 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9651, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9271, device='cuda:0', grad_fn=) [2024-06-18 22:37:25,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:37:25,563] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3591.61 | bwd_microstep: 1967.60 | bwd_inner_microstep: 1961.98 | bwd_allreduce_microstep: 5.50 | step_microstep: 62.13 [2024-06-18 22:37:25,564] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7069.95 | bwd: 3695.59 | bwd_inner: 3683.01 | bwd_allreduce: 12.33 | step: 62.21 20%|██ | 121/600 [22:18<1:25:55, 10.76s/it] {'loss': 0.6208, 'learning_rate': 9.24690143814323e-05, 'epoch': 1.21} 20%|██ | 121/600 [22:18<1:25:55, 10.76s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0390, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1166, device='cuda:0', grad_fn=) [2024-06-18 22:37:30,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3453.08 | bwd_microstep: 1692.88 | bwd_inner_microstep: 1687.84 | bwd_allreduce_microstep: 4.92 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1159, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.1850, device='cuda:0', grad_fn=) [2024-06-18 22:37:36,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 22:37:36,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3484.37 | bwd_microstep: 1738.78 | bwd_inner_microstep: 1733.02 | bwd_allreduce_microstep: 5.65 | step_microstep: 62.75 [2024-06-18 22:37:36,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6937.43 | bwd: 3431.65 | bwd_inner: 3420.87 | bwd_allreduce: 10.58 | step: 62.83 20%|██ | 122/600 [22:29<1:25:22, 10.72s/it] {'loss': 0.1508, 'learning_rate': 9.232595013792002e-05, 'epoch': 1.22} 20%|██ | 122/600 [22:29<1:25:22, 10.72s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0118, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0921, device='cuda:0', grad_fn=) [2024-06-18 22:37:40,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2897.09 | bwd_microstep: 1736.62 | bwd_inner_microstep: 1731.51 | bwd_allreduce_microstep: 4.93 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9924, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.9520, device='cuda:0', grad_fn=) [2024-06-18 22:37:46,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 22:37:46,570] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3569.08 | bwd_microstep: 1933.49 | bwd_inner_microstep: 1927.85 | bwd_allreduce_microstep: 5.53 | step_microstep: 61.78 [2024-06-18 22:37:46,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6466.15 | bwd: 3670.10 | bwd_inner: 3659.42 | bwd_allreduce: 10.48 | step: 61.86 20%|██ | 123/600 [22:39<1:24:25, 10.62s/it] {'loss': 0.522, 'learning_rate': 9.218165262080023e-05, 'epoch': 1.23} 20%|██ | 123/600 [22:39<1:24:25, 10.62s/it]warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(1.0482, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0134, device='cuda:0', grad_fn=) [2024-06-18 22:37:50,477] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2439.81 | bwd_microstep: 1381.36 | bwd_inner_microstep: 1376.13 | bwd_allreduce_microstep: 5.06 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0986, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0587, device='cuda:0', grad_fn=) [2024-06-18 22:37:55,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 22:37:55,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2956.06 | bwd_microstep: 1849.35 | bwd_inner_microstep: 1843.71 | bwd_allreduce_microstep: 5.52 | step_microstep: 62.19 [2024-06-18 22:37:55,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5395.83 | bwd: 3230.70 | bwd_inner: 3219.88 | bwd_allreduce: 10.59 | step: 62.28 21%|██ | 124/600 [22:48<1:20:05, 10.10s/it] {'loss': 1.036, 'learning_rate': 9.203612603454604e-05, 'epoch': 1.24} 21%|██ | 124/600 [22:48<1:20:05, 10.10s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0232, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1024, device='cuda:0', grad_fn=) [2024-06-18 22:38:00,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3485.81 | bwd_microstep: 1738.89 | bwd_inner_microstep: 1733.83 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1747, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.1157, device='cuda:0', grad_fn=) [2024-06-18 22:38:06,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 22:38:06,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3594.55 | bwd_microstep: 1973.23 | bwd_inner_microstep: 1967.63 | bwd_allreduce_microstep: 5.49 | step_microstep: 61.71 [2024-06-18 22:38:06,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7080.34 | bwd: 3712.12 | bwd_inner: 3701.47 | bwd_allreduce: 10.45 | step: 61.79 21%|██ | 125/600 [22:59<1:22:12, 10.38s/it] {'loss': 0.609, 'learning_rate': 9.18893746194426e-05, 'epoch': 1.25} 21%|██ | 125/600 [22:59<1:22:12, 10.38s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.9531, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9163, device='cuda:0', grad_fn=) [2024-06-18 22:38:11,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2967.97 | bwd_microstep: 1856.68 | bwd_inner_microstep: 1851.47 | bwd_allreduce_microstep: 5.09 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2321, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.1674, device='cuda:0', grad_fn=) [2024-06-18 22:38:17,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 22:38:17,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3563.44 | bwd_microstep: 1910.71 | bwd_inner_microstep: 1905.15 | bwd_allreduce_microstep: 5.45 | step_microstep: 62.43 [2024-06-18 22:38:17,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6531.40 | bwd: 3767.40 | bwd_inner: 3756.64 | bwd_allreduce: 10.55 | step: 62.52 21%|██ | 126/600 [23:10<1:22:28, 10.44s/it] {'loss': 1.0418, 'learning_rate': 9.174140265146356e-05, 'epoch': 1.26} 21%|██ | 126/600 [23:10<1:22:28, 10.44s/it]warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.0095, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0901, device='cuda:0', grad_fn=) [2024-06-18 22:38:20,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2205.59 | bwd_microstep: 1273.33 | bwd_inner_microstep: 1268.19 | bwd_allreduce_microstep: 5.00 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7868, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7781, device='cuda:0', grad_fn=) [2024-06-18 22:38:25,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 22:38:25,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2759.37 | bwd_microstep: 1803.46 | bwd_inner_microstep: 1797.91 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.98 [2024-06-18 22:38:25,346] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4964.92 | bwd: 3076.78 | bwd_inner: 3066.14 | bwd_allreduce: 10.45 | step: 62.07 21%|██ | 127/600 [23:18<1:17:11, 9.79s/it] {'loss': 0.4341, 'learning_rate': 9.159221444214645e-05, 'epoch': 1.27} 21%|██ | 127/600 [23:18<1:17:11, 9.79s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8578, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8420, device='cuda:0', grad_fn=) [2024-06-18 22:38:30,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3555.03 | bwd_microstep: 1892.21 | bwd_inner_microstep: 1887.05 | bwd_allreduce_microstep: 5.05 | step_microstep: 0.14 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9279, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9051, device='cuda:0', grad_fn=) [2024-06-18 22:38:36,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:38:36,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3571.51 | bwd_microstep: 1924.07 | bwd_inner_microstep: 1918.54 | bwd_allreduce_microstep: 5.42 | step_microstep: 62.26 [2024-06-18 22:38:36,554] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7126.50 | bwd: 3816.28 | bwd_inner: 3805.60 | bwd_allreduce: 10.48 | step: 62.40 21%|██▏ | 128/600 [23:29<1:20:21, 10.22s/it] {'loss': 0.8735, 'learning_rate': 9.144181433846707e-05, 'epoch': 1.28} 21%|██▏ | 128/600 [23:29<1:20:21, 10.22s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7531, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.7481, device='cuda:0', grad_fn=) [2024-06-18 22:38:42,146] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3569.22 | bwd_microstep: 1921.35 | bwd_inner_microstep: 1916.26 | bwd_allreduce_microstep: 4.98 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0170, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.0972, device='cuda:0', grad_fn=) [2024-06-18 22:38:47,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 22:38:47,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3508.61 | bwd_microstep: 1810.07 | bwd_inner_microstep: 1804.52 | bwd_allreduce_microstep: 5.46 | step_microstep: 61.87 [2024-06-18 22:38:47,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7077.81 | bwd: 3731.42 | bwd_inner: 3720.78 | bwd_allreduce: 10.44 | step: 61.96 22%|██▏ | 129/600 [23:40<1:22:11, 10.47s/it] {'loss': 0.4226, 'learning_rate': 9.129020672271283e-05, 'epoch': 1.29} 22%|██▏ | 129/600 [23:40<1:22:11, 10.47s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9120, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.8793, device='cuda:0', grad_fn=) [2024-06-18 22:38:53,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3576.70 | bwd_microstep: 1942.23 | bwd_inner_microstep: 1937.12 | bwd_allreduce_microstep: 5.01 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0970, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0573, device='cuda:0', grad_fn=) [2024-06-18 22:38:58,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:38:58,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3587.13 | bwd_microstep: 1955.38 | bwd_inner_microstep: 1949.83 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.97 [2024-06-18 22:38:58,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7163.80 | bwd: 3897.60 | bwd_inner: 3886.95 | bwd_allreduce: 10.47 | step: 62.05 22%|██▏ | 130/600 [23:52<1:24:03, 10.73s/it] {'loss': 0.9683, 'learning_rate': 9.113739601235507e-05, 'epoch': 1.3} 22%|██▏ | 130/600 [23:52<1:24:03, 10.73s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0072, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.0884, device='cuda:0', grad_fn=) [2024-06-18 22:39:04,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3491.48 | bwd_microstep: 1745.07 | bwd_inner_microstep: 1739.80 | bwd_allreduce_microstep: 5.13 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1247, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0822, device='cuda:0', grad_fn=) [2024-06-18 22:39:09,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.97 [2024-06-18 22:39:09,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3586.05 | bwd_microstep: 1957.73 | bwd_inner_microstep: 1952.11 | bwd_allreduce_microstep: 5.50 | step_microstep: 62.38 [2024-06-18 22:39:09,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7077.51 | bwd: 3702.80 | bwd_inner: 3691.95 | bwd_allreduce: 10.64 | step: 62.46 22%|██▏ | 131/600 [24:03<1:24:36, 10.82s/it] {'loss': 0.5853, 'learning_rate': 9.09833866599203e-05, 'epoch': 1.31} 22%|██▏ | 131/600 [24:03<1:24:36, 10.82s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4879, device='cuda:0', grad_fn=) tensor(0.6965, device='cuda:0', grad_fn=) tensor(0.5088, device='cuda:0', grad_fn=) [2024-06-18 22:39:15,578] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3564.41 | bwd_microstep: 1916.92 | bwd_inner_microstep: 1911.85 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0262, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1050, device='cuda:0', grad_fn=) [2024-06-18 22:39:20,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 22:39:20,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3487.15 | bwd_microstep: 1737.47 | bwd_inner_microstep: 1731.79 | bwd_allreduce_microstep: 5.56 | step_microstep: 61.94 [2024-06-18 22:39:20,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7051.54 | bwd: 3654.39 | bwd_inner: 3643.70 | bwd_allreduce: 10.48 | step: 62.03 22%|██▏ | 132/600 [24:14<1:24:44, 10.86s/it] {'loss': 0.3069, 'learning_rate': 9.082818315286055e-05, 'epoch': 1.32} 22%|██▏ | 132/600 [24:14<1:24:44, 10.86s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(1.1994, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.1494, device='cuda:0', grad_fn=) [2024-06-18 22:39:25,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2765.63 | bwd_microstep: 1807.06 | bwd_inner_microstep: 1802.01 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0289, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1075, device='cuda:0', grad_fn=) [2024-06-18 22:39:31,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 22:39:31,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3488.26 | bwd_microstep: 1745.38 | bwd_inner_microstep: 1739.66 | bwd_allreduce_microstep: 5.61 | step_microstep: 62.76 [2024-06-18 22:39:31,016] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6253.88 | bwd: 3552.44 | bwd_inner: 3541.69 | bwd_allreduce: 10.56 | step: 62.85 22%|██▏ | 133/600 [24:24<1:22:41, 10.62s/it] {'loss': 0.6285, 'learning_rate': 9.067179001342252e-05, 'epoch': 1.33} 22%|██▏ | 133/600 [24:24<1:22:41, 10.62s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0276, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.1067, device='cuda:0', grad_fn=) [2024-06-18 22:39:36,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3505.99 | bwd_microstep: 1805.89 | bwd_inner_microstep: 1800.79 | bwd_allreduce_microstep: 4.92 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1197, device='cuda:0', grad_fn=) tensor(0.6965, device='cuda:0', grad_fn=) tensor(1.0774, device='cuda:0', grad_fn=) [2024-06-18 22:39:42,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:39:42,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3565.59 | bwd_microstep: 1912.50 | bwd_inner_microstep: 1906.92 | bwd_allreduce_microstep: 5.46 | step_microstep: 61.70 [2024-06-18 22:39:42,060] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7071.57 | bwd: 3718.38 | bwd_inner: 3707.76 | bwd_allreduce: 10.37 | step: 61.78 22%|██▏ | 134/600 [24:35<1:23:29, 10.75s/it] {'loss': 0.592, 'learning_rate': 9.051421179851588e-05, 'epoch': 1.34} 22%|██▏ | 134/600 [24:35<1:23:29, 10.75s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(1.2620, device='cuda:0', grad_fn=) tensor(0.5814, device='cuda:0', grad_fn=) tensor(1.1940, device='cuda:0', grad_fn=) [2024-06-18 22:39:47,061] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2986.42 | bwd_microstep: 1906.26 | bwd_inner_microstep: 1901.24 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(1.1479, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0916, device='cuda:0', grad_fn=) [2024-06-18 22:39:51,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.95 [2024-06-18 22:39:51,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2783.10 | bwd_microstep: 1839.73 | bwd_inner_microstep: 1833.96 | bwd_allreduce_microstep: 5.66 | step_microstep: 62.82 [2024-06-18 22:39:51,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5769.51 | bwd: 3745.99 | bwd_inner: 3735.21 | bwd_allreduce: 10.58 | step: 62.89 22%|██▎ | 135/600 [24:45<1:21:05, 10.46s/it] {'loss': 1.1428, 'learning_rate': 9.035545309958046e-05, 'epoch': 1.35} 22%|██▎ | 135/600 [24:45<1:21:05, 10.46s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0350, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1130, device='cuda:0', grad_fn=) [2024-06-18 22:39:57,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3503.02 | bwd_microstep: 1804.40 | bwd_inner_microstep: 1799.39 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7008, device='cuda:0', grad_fn=) tensor(0.7039, device='cuda:0', grad_fn=) tensor(0.7011, device='cuda:0', grad_fn=) [2024-06-18 22:40:02,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 22:40:02,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.80 | bwd_microstep: 1899.42 | bwd_inner_microstep: 1893.81 | bwd_allreduce_microstep: 5.52 | step_microstep: 62.45 [2024-06-18 22:40:02,873] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7062.82 | bwd: 3703.82 | bwd_inner: 3693.20 | bwd_allreduce: 10.43 | step: 62.53 23%|██▎ | 136/600 [24:56<1:22:12, 10.63s/it] {'loss': 0.407, 'learning_rate': 9.01955185424525e-05, 'epoch': 1.36} 23%|██▎ | 136/600 [24:56<1:22:12, 10.63s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7301, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.7267, device='cuda:0', grad_fn=) [2024-06-18 22:40:08,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3561.92 | bwd_microstep: 1912.53 | bwd_inner_microstep: 1907.25 | bwd_allreduce_microstep: 5.17 | step_microstep: 0.13 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0231, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1023, device='cuda:0', grad_fn=) [2024-06-18 22:40:13,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 22:40:13,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3483.02 | bwd_microstep: 1740.44 | bwd_inner_microstep: 1734.88 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.87 [2024-06-18 22:40:13,828] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7044.92 | bwd: 3652.96 | bwd_inner: 3642.15 | bwd_allreduce: 10.62 | step: 62.01 23%|██▎ | 137/600 [25:06<1:22:46, 10.73s/it] {'loss': 0.4145, 'learning_rate': 9.003441278722981e-05, 'epoch': 1.37} 23%|██▎ | 137/600 [25:06<1:22:46, 10.73s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(1.1700, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.1230, device='cuda:0', grad_fn=) [2024-06-18 22:40:18,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2803.85 | bwd_microstep: 1898.94 | bwd_inner_microstep: 1893.87 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9867, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9465, device='cuda:0', grad_fn=) [2024-06-18 22:40:24,408] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 22:40:24,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3602.52 | bwd_microstep: 1991.59 | bwd_inner_microstep: 1985.95 | bwd_allreduce_microstep: 5.47 | step_microstep: 62.67 [2024-06-18 22:40:24,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6406.33 | bwd: 3890.52 | bwd_inner: 3879.92 | bwd_allreduce: 10.38 | step: 62.75 23%|██▎ | 138/600 [25:17<1:22:15, 10.68s/it] {'loss': 1.0347, 'learning_rate': 8.987214052813604e-05, 'epoch': 1.38} 23%|██▎ | 138/600 [25:17<1:22:15, 10.68s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1179, device='cuda:0', grad_fn=) tensor(0.5814, device='cuda:0', grad_fn=) tensor(1.0643, device='cuda:0', grad_fn=) [2024-06-18 22:40:29,981] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3561.44 | bwd_microstep: 1906.52 | bwd_inner_microstep: 1901.31 | bwd_allreduce_microstep: 5.09 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.4389, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(1.3646, device='cuda:0', grad_fn=) [2024-06-18 22:40:34,856] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:40:34,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2808.73 | bwd_microstep: 1899.73 | bwd_inner_microstep: 1894.17 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.73 [2024-06-18 22:40:34,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6370.15 | bwd: 3806.24 | bwd_inner: 3795.50 | bwd_allreduce: 10.54 | step: 61.82 23%|██▎ | 139/600 [25:28<1:21:32, 10.61s/it] {'loss': 1.2144, 'learning_rate': 8.970870649338387e-05, 'epoch': 1.39} 23%|██▎ | 139/600 [25:28<1:21:32, 10.61s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7753, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7677, device='cuda:0', grad_fn=) [2024-06-18 22:40:40,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3561.26 | bwd_microstep: 1918.89 | bwd_inner_microstep: 1913.75 | bwd_allreduce_microstep: 5.02 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.4808, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(1.4031, device='cuda:0', grad_fn=) [2024-06-18 22:40:45,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.95 [2024-06-18 22:40:45,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3206.76 | bwd_microstep: 1788.29 | bwd_inner_microstep: 1782.52 | bwd_allreduce_microstep: 5.65 | step_microstep: 62.46 [2024-06-18 22:40:45,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6767.99 | bwd: 3707.17 | bwd_inner: 3696.29 | bwd_allreduce: 10.67 | step: 62.55 23%|██▎ | 140/600 [25:38<1:21:39, 10.65s/it] {'loss': 1.0854, 'learning_rate': 8.954411544503729e-05, 'epoch': 1.4} 23%|██▎ | 140/600 [25:38<1:21:39, 10.65s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.0914, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1638, device='cuda:0', grad_fn=) [2024-06-18 22:40:49,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2662.48 | bwd_microstep: 1608.16 | bwd_inner_microstep: 1603.20 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0483, device='cuda:0', grad_fn=) tensor(0.6965, device='cuda:0', grad_fn=) tensor(1.0132, device='cuda:0', grad_fn=) [2024-06-18 22:40:55,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.96 [2024-06-18 22:40:55,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3562.82 | bwd_microstep: 1915.87 | bwd_inner_microstep: 1910.22 | bwd_allreduce_microstep: 5.55 | step_microstep: 62.09 [2024-06-18 22:40:55,595] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6225.28 | bwd: 3524.02 | bwd_inner: 3513.42 | bwd_allreduce: 10.41 | step: 62.17 24%|██▎ | 141/600 [25:48<1:19:58, 10.46s/it] {'loss': 0.5885, 'learning_rate': 8.937837217887273e-05, 'epoch': 1.41} 24%|██▎ | 141/600 [25:48<1:19:58, 10.46s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2535, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.1981, device='cuda:0', grad_fn=) [2024-06-18 22:41:01,249] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3589.11 | bwd_microstep: 1959.85 | bwd_inner_microstep: 1954.83 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8923, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.8616, device='cuda:0', grad_fn=) [2024-06-18 22:41:06,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 2.05 [2024-06-18 22:41:06,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3566.43 | bwd_microstep: 1922.59 | bwd_inner_microstep: 1916.88 | bwd_allreduce_microstep: 5.60 | step_microstep: 64.17 [2024-06-18 22:41:06,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7155.51 | bwd: 3882.44 | bwd_inner: 3871.71 | bwd_allreduce: 10.56 | step: 64.25 24%|██▎ | 142/600 [26:00<1:21:45, 10.71s/it] {'loss': 1.0299, 'learning_rate': 8.921148152423946e-05, 'epoch': 1.42} 24%|██▎ | 142/600 [26:00<1:21:45, 10.71s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(1.4050, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.3345, device='cuda:0', grad_fn=) [2024-06-18 22:41:11,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2791.18 | bwd_microstep: 1872.14 | bwd_inner_microstep: 1867.06 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0371, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(1.0029, device='cuda:0', grad_fn=) [2024-06-18 22:41:17,336] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:41:17,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3572.64 | bwd_microstep: 1926.12 | bwd_inner_microstep: 1920.48 | bwd_allreduce_microstep: 5.47 | step_microstep: 61.87 [2024-06-18 22:41:17,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6363.78 | bwd: 3798.26 | bwd_inner: 3787.64 | bwd_allreduce: 10.38 | step: 61.96 24%|██▍ | 143/600 [26:10<1:20:56, 10.63s/it] {'loss': 1.1687, 'learning_rate': 8.904344834391882e-05, 'epoch': 1.43} 24%|██▍ | 143/600 [26:10<1:20:56, 10.63s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2715, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(1.2139, device='cuda:0', grad_fn=) [2024-06-18 22:41:22,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3553.06 | bwd_microstep: 1881.02 | bwd_inner_microstep: 1875.90 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(1.0427, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9969, device='cuda:0', grad_fn=) [2024-06-18 22:41:27,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 22:41:27,654] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2779.00 | bwd_microstep: 1841.49 | bwd_inner_microstep: 1835.93 | bwd_allreduce_microstep: 5.45 | step_microstep: 62.01 [2024-06-18 22:41:27,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6332.05 | bwd: 3722.51 | bwd_inner: 3711.88 | bwd_allreduce: 10.41 | step: 62.09 24%|██▍ | 144/600 [26:20<1:20:03, 10.53s/it] {'loss': 1.1054, 'learning_rate': 8.887427753398248e-05, 'epoch': 1.44} 24%|██▍ | 144/600 [26:20<1:20:03, 10.53s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9892, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9603, device='cuda:0', grad_fn=) [2024-06-18 22:41:32,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2696.62 | bwd_microstep: 1661.95 | bwd_inner_microstep: 1656.94 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8777, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8599, device='cuda:0', grad_fn=) [2024-06-18 22:41:37,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.94 [2024-06-18 22:41:37,731] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3564.95 | bwd_microstep: 1895.49 | bwd_inner_microstep: 1889.78 | bwd_allreduce_microstep: 5.60 | step_microstep: 62.23 [2024-06-18 22:41:37,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6261.56 | bwd: 3557.44 | bwd_inner: 3546.73 | bwd_allreduce: 10.51 | step: 62.31 24%|██▍ | 145/600 [26:30<1:18:50, 10.40s/it] {'loss': 0.9101, 'learning_rate': 8.870397402364984e-05, 'epoch': 1.45} 24%|██▍ | 145/600 [26:30<1:18:50, 10.40s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1473, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.2144, device='cuda:0', grad_fn=) [2024-06-18 22:41:43,024] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3476.20 | bwd_microstep: 1725.58 | bwd_inner_microstep: 1720.50 | bwd_allreduce_microstep: 4.98 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1024) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1024, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.9894, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9604, device='cuda:0', grad_fn=) [2024-06-18 22:41:46,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 22:41:46,703] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2245.56 | bwd_microstep: 1288.85 | bwd_inner_microstep: 1283.32 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.92 [2024-06-18 22:41:46,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5721.74 | bwd: 3014.42 | bwd_inner: 3003.83 | bwd_allreduce: 10.40 | step: 62.00 24%|██▍ | 146/600 [26:39<1:15:26, 9.97s/it] {'loss': 0.5874, 'learning_rate': 8.853254277514446e-05, 'epoch': 1.46} 24%|██▍ | 146/600 [26:39<1:15:26, 9.97s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.0210, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1004, device='cuda:0', grad_fn=) [2024-06-18 22:41:51,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2687.66 | bwd_microstep: 1652.97 | bwd_inner_microstep: 1647.92 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9126, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.8909, device='cuda:0', grad_fn=) [2024-06-18 22:41:56,084] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 22:41:56,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2954.87 | bwd_microstep: 1833.83 | bwd_inner_microstep: 1828.13 | bwd_allreduce_microstep: 5.59 | step_microstep: 62.02 [2024-06-18 22:41:56,085] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5642.51 | bwd: 3486.79 | bwd_inner: 3476.07 | bwd_allreduce: 10.54 | step: 62.11 24%|██▍ | 147/600 [26:49<1:13:56, 9.79s/it] {'loss': 0.4957, 'learning_rate': 8.835998878354931e-05, 'epoch': 1.47} 24%|██▍ | 147/600 [26:49<1:13:56, 9.79s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0022, device='cuda:0', grad_fn=) tensor(0.6965, device='cuda:0', grad_fn=) tensor(0.9716, device='cuda:0', grad_fn=) [2024-06-18 22:42:01,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3555.87 | bwd_microstep: 1892.50 | bwd_inner_microstep: 1887.28 | bwd_allreduce_microstep: 5.11 | step_microstep: 0.09 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0169, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0967, device='cuda:0', grad_fn=) [2024-06-18 22:42:07,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 22:42:07,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3504.95 | bwd_microstep: 1808.51 | bwd_inner_microstep: 1802.96 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.98 [2024-06-18 22:42:07,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7060.78 | bwd: 3701.00 | bwd_inner: 3690.25 | bwd_allreduce: 10.55 | step: 62.08 25%|██▍ | 148/600 [27:00<1:16:32, 10.16s/it] {'loss': 0.5342, 'learning_rate': 8.818631707666135e-05, 'epoch': 1.48} 25%|██▍ | 148/600 [27:00<1:16:32, 10.16s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0571, device='cuda:0', grad_fn=) tensor(0.7006, device='cuda:0', grad_fn=) tensor(1.0214, device='cuda:0', grad_fn=) [2024-06-18 22:42:11,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2772.90 | bwd_microstep: 1831.56 | bwd_inner_microstep: 1826.52 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1060, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0538, device='cuda:0', grad_fn=) [2024-06-18 22:42:16,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:42:16,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2780.65 | bwd_microstep: 1842.28 | bwd_inner_microstep: 1836.75 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.99 [2024-06-18 22:42:16,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5553.53 | bwd: 3673.83 | bwd_inner: 3663.31 | bwd_allreduce: 10.32 | step: 62.07 25%|██▍ | 149/600 [27:09<1:14:52, 9.96s/it] {'loss': 1.0376, 'learning_rate': 8.801153271484502e-05, 'epoch': 1.49} 25%|██▍ | 149/600 [27:09<1:14:52, 9.96s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9582, device='cuda:0', grad_fn=) tensor(0.6965, device='cuda:0', grad_fn=) tensor(0.9321, device='cuda:0', grad_fn=) [2024-06-18 22:42:22,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3555.66 | bwd_microstep: 1892.55 | bwd_inner_microstep: 1887.28 | bwd_allreduce_microstep: 5.14 | step_microstep: 0.10 warning: The size of tensor a (0) must match the size of tensor b (1024) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1024, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8916, device='cuda:0', grad_fn=) tensor(0.5814, device='cuda:0', grad_fn=) tensor(0.8606, device='cuda:0', grad_fn=) [2024-06-18 22:42:27,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 22:42:27,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3106.58 | bwd_microstep: 1861.48 | bwd_inner_microstep: 1855.83 | bwd_allreduce_microstep: 5.53 | step_microstep: 62.24 [2024-06-18 22:42:27,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6662.21 | bwd: 3754.01 | bwd_inner: 3743.14 | bwd_allreduce: 10.68 | step: 62.35 25%|██▌ | 150/600 [27:20<1:16:20, 10.18s/it] {'loss': 0.8963, 'learning_rate': 8.783564079088477e-05, 'epoch': 1.5} 25%|██▌ | 150/600 [27:20<1:16:20, 10.18s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0178, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9860, device='cuda:0', grad_fn=) [2024-06-18 22:42:32,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3586.29 | bwd_microstep: 1959.85 | bwd_inner_microstep: 1954.88 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5811, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5929, device='cuda:0', grad_fn=) [2024-06-18 22:42:38,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 22:42:38,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3568.04 | bwd_microstep: 1920.07 | bwd_inner_microstep: 1914.49 | bwd_allreduce_microstep: 5.47 | step_microstep: 62.19 [2024-06-18 22:42:38,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7154.29 | bwd: 3879.92 | bwd_inner: 3869.40 | bwd_allreduce: 10.32 | step: 62.28 25%|██▌ | 151/600 [27:31<1:18:41, 10.52s/it] {'loss': 0.7895, 'learning_rate': 8.765864642983665e-05, 'epoch': 1.51} 25%|██▌ | 151/600 [27:31<1:18:41, 10.52s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0330, device='cuda:0', grad_fn=) tensor(0.7072, device='cuda:0', grad_fn=) tensor(1.0005, device='cuda:0', grad_fn=) [2024-06-18 22:42:44,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3589.51 | bwd_microstep: 1963.05 | bwd_inner_microstep: 1957.87 | bwd_allreduce_microstep: 5.07 | step_microstep: 0.14 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8778, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8488, device='cuda:0', grad_fn=) [2024-06-18 22:42:49,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 22:42:49,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3575.39 | bwd_microstep: 1932.75 | bwd_inner_microstep: 1927.21 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.75 [2024-06-18 22:42:49,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7164.88 | bwd: 3895.79 | bwd_inner: 3885.09 | bwd_allreduce: 10.51 | step: 61.89 25%|██▌ | 152/600 [27:43<1:20:20, 10.76s/it] {'loss': 0.9246, 'learning_rate': 8.748055478887904e-05, 'epoch': 1.52} 25%|██▌ | 152/600 [27:43<1:20:20, 10.76s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.4022, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.3319, device='cuda:0', grad_fn=) [2024-06-18 22:42:55,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3589.92 | bwd_microstep: 1962.21 | bwd_inner_microstep: 1957.21 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9055, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8738, device='cuda:0', grad_fn=) [2024-06-18 22:43:01,232] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:43:01,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3568.90 | bwd_microstep: 1925.11 | bwd_inner_microstep: 1919.47 | bwd_allreduce_microstep: 5.53 | step_microstep: 61.87 [2024-06-18 22:43:01,233] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7158.80 | bwd: 3887.31 | bwd_inner: 3876.70 | bwd_allreduce: 10.42 | step: 61.96 26%|██▌ | 153/600 [27:54<1:21:24, 10.93s/it] {'loss': 1.1028, 'learning_rate': 8.73013710571623e-05, 'epoch': 1.53} 26%|██▌ | 153/600 [27:54<1:21:24, 10.93s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7894, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7804, device='cuda:0', grad_fn=) [2024-06-18 22:43:06,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3560.10 | bwd_microstep: 1891.38 | bwd_inner_microstep: 1886.37 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0467, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(1.0116, device='cuda:0', grad_fn=) [2024-06-18 22:43:12,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 22:43:12,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3570.96 | bwd_microstep: 1922.85 | bwd_inner_microstep: 1917.36 | bwd_allreduce_microstep: 5.41 | step_microstep: 61.83 [2024-06-18 22:43:12,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7131.05 | bwd: 3814.21 | bwd_inner: 3803.72 | bwd_allreduce: 10.32 | step: 61.90 26%|██▌ | 154/600 [28:05<1:21:50, 11.01s/it] {'loss': 0.896, 'learning_rate': 8.712110045565768e-05, 'epoch': 1.54} 26%|██▌ | 154/600 [28:05<1:21:50, 11.01s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0214, device='cuda:0', grad_fn=) tensor(0.6965, device='cuda:0', grad_fn=) tensor(0.9889, device='cuda:0', grad_fn=) [2024-06-18 22:43:18,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3565.63 | bwd_microstep: 1893.71 | bwd_inner_microstep: 1888.45 | bwd_allreduce_microstep: 5.14 | step_microstep: 0.10 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.9845, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.9556, device='cuda:0', grad_fn=) [2024-06-18 22:43:22,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.93 [2024-06-18 22:43:22,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2760.04 | bwd_microstep: 1804.30 | bwd_inner_microstep: 1798.72 | bwd_allreduce_microstep: 5.47 | step_microstep: 62.13 [2024-06-18 22:43:22,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6325.65 | bwd: 3698.00 | bwd_inner: 3687.18 | bwd_allreduce: 10.62 | step: 62.24 26%|██▌ | 155/600 [28:15<1:20:02, 10.79s/it] {'loss': 0.9723, 'learning_rate': 8.693974823700506e-05, 'epoch': 1.55} 26%|██▌ | 155/600 [28:15<1:20:02, 10.79s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0986, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0472, device='cuda:0', grad_fn=) [2024-06-18 22:43:28,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3591.76 | bwd_microstep: 1974.92 | bwd_inner_microstep: 1969.82 | bwd_allreduce_microstep: 4.97 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.4896, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.4106, device='cuda:0', grad_fn=) [2024-06-18 22:43:34,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 22:43:34,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3584.59 | bwd_microstep: 1950.31 | bwd_inner_microstep: 1944.76 | bwd_allreduce_microstep: 5.44 | step_microstep: 62.06 [2024-06-18 22:43:34,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7176.34 | bwd: 3925.23 | bwd_inner: 3914.58 | bwd_allreduce: 10.43 | step: 62.14 26%|██▌ | 156/600 [28:27<1:21:09, 10.97s/it] {'loss': 1.2289, 'learning_rate': 8.675731968536002e-05, 'epoch': 1.56} 26%|██▌ | 156/600 [28:27<1:21:09, 10.97s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0643, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(1.0274, device='cuda:0', grad_fn=) [2024-06-18 22:43:39,686] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3566.30 | bwd_microstep: 1919.00 | bwd_inner_microstep: 1913.91 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2581, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.2023, device='cuda:0', grad_fn=) [2024-06-18 22:43:45,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:43:45,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3588.82 | bwd_microstep: 1959.71 | bwd_inner_microstep: 1954.03 | bwd_allreduce_microstep: 5.55 | step_microstep: 62.06 [2024-06-18 22:43:45,400] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7155.10 | bwd: 3878.70 | bwd_inner: 3867.99 | bwd_allreduce: 10.51 | step: 62.15 26%|██▌ | 157/600 [28:38<1:21:42, 11.07s/it] {'loss': 1.1148, 'learning_rate': 8.657382011623981e-05, 'epoch': 1.57} 26%|██▌ | 157/600 [28:38<1:21:42, 11.07s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.3156, device='cuda:0', grad_fn=) tensor(0.6924, device='cuda:0', grad_fn=) tensor(1.2533, device='cuda:0', grad_fn=) [2024-06-18 22:43:51,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3594.08 | bwd_microstep: 1968.16 | bwd_inner_microstep: 1963.02 | bwd_allreduce_microstep: 4.96 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0350, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1126, device='cuda:0', grad_fn=) [2024-06-18 22:43:56,448] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.94 [2024-06-18 22:43:56,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3484.57 | bwd_microstep: 1741.37 | bwd_inner_microstep: 1735.41 | bwd_allreduce_microstep: 5.84 | step_microstep: 62.87 [2024-06-18 22:43:56,449] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7078.64 | bwd: 3709.52 | bwd_inner: 3698.48 | bwd_allreduce: 10.81 | step: 62.95 26%|██▋ | 158/600 [28:49<1:21:29, 11.06s/it] {'loss': 0.6829, 'learning_rate': 8.638925487636848e-05, 'epoch': 1.58} 26%|██▋ | 158/600 [28:49<1:21:29, 11.06s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9962, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.9661, device='cuda:0', grad_fn=) [2024-06-18 22:44:02,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3567.87 | bwd_microstep: 1925.54 | bwd_inner_microstep: 1920.46 | bwd_allreduce_microstep: 4.92 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0476, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0128, device='cuda:0', grad_fn=) [2024-06-18 22:44:07,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:44:07,193] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3203.43 | bwd_microstep: 1785.03 | bwd_inner_microstep: 1779.51 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.85 [2024-06-18 22:44:07,194] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6771.26 | bwd: 3710.56 | bwd_inner: 3700.03 | bwd_allreduce: 10.33 | step: 61.94 26%|██▋ | 159/600 [29:00<1:20:36, 10.97s/it] {'loss': 0.9895, 'learning_rate': 8.620362934352109e-05, 'epoch': 1.59} 26%|██▋ | 159/600 [29:00<1:20:36, 10.97s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0661, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1405, device='cuda:0', grad_fn=) [2024-06-18 22:44:12,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3476.34 | bwd_microstep: 1726.46 | bwd_inner_microstep: 1721.21 | bwd_allreduce_microstep: 5.11 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0348, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.9901, device='cuda:0', grad_fn=) [2024-06-18 22:44:18,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.94 [2024-06-18 22:44:18,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3576.66 | bwd_microstep: 1941.19 | bwd_inner_microstep: 1935.65 | bwd_allreduce_microstep: 5.42 | step_microstep: 62.42 [2024-06-18 22:44:18,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7052.95 | bwd: 3667.63 | bwd_inner: 3656.89 | bwd_allreduce: 10.53 | step: 62.51 27%|██▋ | 160/600 [29:11<1:20:27, 10.97s/it] {'loss': 0.5653, 'learning_rate': 8.6016948926367e-05, 'epoch': 1.6} 27%|██▋ | 160/600 [29:11<1:20:27, 10.97s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8145, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.8034, device='cuda:0', grad_fn=) [2024-06-18 22:44:23,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3561.17 | bwd_microstep: 1912.33 | bwd_inner_microstep: 1907.30 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8773, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8483, device='cuda:0', grad_fn=) [2024-06-18 22:44:29,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 22:44:29,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3593.24 | bwd_microstep: 1974.05 | bwd_inner_microstep: 1968.39 | bwd_allreduce_microstep: 5.48 | step_microstep: 62.01 [2024-06-18 22:44:29,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7154.37 | bwd: 3886.37 | bwd_inner: 3875.77 | bwd_allreduce: 10.36 | step: 62.09 27%|██▋ | 161/600 [29:22<1:21:01, 11.07s/it] {'loss': 0.8259, 'learning_rate': 8.582921906431237e-05, 'epoch': 1.61} 27%|██▋ | 161/600 [29:22<1:21:01, 11.07s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5234, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5410, device='cuda:0', grad_fn=) [2024-06-18 22:44:34,202] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2773.56 | bwd_microstep: 1836.08 | bwd_inner_microstep: 1830.99 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0617, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(1.0143, device='cuda:0', grad_fn=) [2024-06-18 22:44:39,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:44:39,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3593.89 | bwd_microstep: 1976.20 | bwd_inner_microstep: 1970.65 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.93 [2024-06-18 22:44:39,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6367.44 | bwd: 3812.27 | bwd_inner: 3801.70 | bwd_allreduce: 10.36 | step: 62.01 27%|██▋ | 162/600 [29:33<1:19:28, 10.89s/it] {'loss': 0.7777, 'learning_rate': 8.564044522734147e-05, 'epoch': 1.62} 27%|██▋ | 162/600 [29:33<1:19:28, 10.89s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7631, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7567, device='cuda:0', grad_fn=) [2024-06-18 22:44:45,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3586.30 | bwd_microstep: 1959.85 | bwd_inner_microstep: 1954.81 | bwd_allreduce_microstep: 4.92 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2835, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(1.2251, device='cuda:0', grad_fn=) [2024-06-18 22:44:51,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 22:44:51,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3574.57 | bwd_microstep: 1928.10 | bwd_inner_microstep: 1922.53 | bwd_allreduce_microstep: 5.46 | step_microstep: 61.87 [2024-06-18 22:44:51,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7160.82 | bwd: 3887.94 | bwd_inner: 3877.36 | bwd_allreduce: 10.39 | step: 61.96 27%|██▋ | 163/600 [29:44<1:20:14, 11.02s/it] {'loss': 0.9909, 'learning_rate': 8.545063291585752e-05, 'epoch': 1.63} 27%|██▋ | 163/600 [29:44<1:20:14, 11.02s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1720, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.2359, device='cuda:0', grad_fn=) [2024-06-18 22:44:56,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3480.96 | bwd_microstep: 1740.85 | bwd_inner_microstep: 1735.45 | bwd_allreduce_microstep: 5.26 | step_microstep: 0.10 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6878, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6893, device='cuda:0', grad_fn=) [2024-06-18 22:45:02,179] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.96 [2024-06-18 22:45:02,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3557.49 | bwd_microstep: 1889.42 | bwd_inner_microstep: 1883.81 | bwd_allreduce_microstep: 5.50 | step_microstep: 62.64 [2024-06-18 22:45:02,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7038.42 | bwd: 3630.26 | bwd_inner: 3619.28 | bwd_allreduce: 10.77 | step: 62.74 27%|██▋ | 164/600 [29:55<1:19:50, 10.99s/it] {'loss': 0.4626, 'learning_rate': 8.52597876605223e-05, 'epoch': 1.64} 27%|██▋ | 164/600 [29:55<1:19:50, 10.99s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2054, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.1548, device='cuda:0', grad_fn=) [2024-06-18 22:45:07,861] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3593.45 | bwd_microstep: 1979.34 | bwd_inner_microstep: 1974.21 | bwd_allreduce_microstep: 5.02 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9867, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.9472, device='cuda:0', grad_fn=) [2024-06-18 22:45:12,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.96 [2024-06-18 22:45:12,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2770.28 | bwd_microstep: 1815.25 | bwd_inner_microstep: 1809.54 | bwd_allreduce_microstep: 5.55 | step_microstep: 62.14 [2024-06-18 22:45:12,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6363.71 | bwd: 3794.59 | bwd_inner: 3783.81 | bwd_allreduce: 10.57 | step: 62.23 28%|██▊ | 165/600 [30:05<1:18:27, 10.82s/it] {'loss': 1.051, 'learning_rate': 8.506791502209496e-05, 'epoch': 1.65} 28%|██▊ | 165/600 [30:05<1:18:27, 10.82s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.4360, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.3624, device='cuda:0', grad_fn=) [2024-06-18 22:45:18,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3558.04 | bwd_microstep: 1893.56 | bwd_inner_microstep: 1888.35 | bwd_allreduce_microstep: 5.08 | step_microstep: 0.14 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7656, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.7594, device='cuda:0', grad_fn=) [2024-06-18 22:45:23,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.95 [2024-06-18 22:45:23,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3561.33 | bwd_microstep: 1892.24 | bwd_inner_microstep: 1886.65 | bwd_allreduce_microstep: 5.43 | step_microstep: 63.34 [2024-06-18 22:45:23,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7119.33 | bwd: 3785.80 | bwd_inner: 3775.08 | bwd_allreduce: 10.51 | step: 63.49 28%|██▊ | 166/600 [30:16<1:19:01, 10.93s/it] {'loss': 1.0609, 'learning_rate': 8.487502059127015e-05, 'epoch': 1.66} 28%|██▊ | 166/600 [30:16<1:19:01, 10.93s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1862, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.2486, device='cuda:0', grad_fn=) [2024-06-18 22:45:28,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3450.92 | bwd_microstep: 1676.54 | bwd_inner_microstep: 1671.19 | bwd_allreduce_microstep: 5.19 | step_microstep: 0.09 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1144, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(1.0725, device='cuda:0', grad_fn=) [2024-06-18 22:45:34,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 22:45:34,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3564.22 | bwd_microstep: 1889.39 | bwd_inner_microstep: 1883.74 | bwd_allreduce_microstep: 5.52 | step_microstep: 62.70 [2024-06-18 22:45:34,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7015.12 | bwd: 3565.93 | bwd_inner: 3554.98 | bwd_allreduce: 10.73 | step: 62.79 28%|██▊ | 167/600 [30:27<1:18:38, 10.90s/it] {'loss': 0.6606, 'learning_rate': 8.468110998851496e-05, 'epoch': 1.67} 28%|██▊ | 167/600 [30:27<1:18:38, 10.90s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8698, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8417, device='cuda:0', grad_fn=) [2024-06-18 22:45:40,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3600.56 | bwd_microstep: 1978.87 | bwd_inner_microstep: 1973.65 | bwd_allreduce_microstep: 5.05 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6293, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.6251, device='cuda:0', grad_fn=) [2024-06-18 22:45:46,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.93 [2024-06-18 22:45:46,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3598.05 | bwd_microstep: 1976.98 | bwd_inner_microstep: 1971.32 | bwd_allreduce_microstep: 5.55 | step_microstep: 62.30 [2024-06-18 22:45:46,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7198.60 | bwd: 3955.85 | bwd_inner: 3945.04 | bwd_allreduce: 10.60 | step: 62.38 28%|██▊ | 168/600 [30:39<1:19:37, 11.06s/it] {'loss': 0.7334, 'learning_rate': 8.448618886390522e-05, 'epoch': 1.68} 28%|██▊ | 168/600 [30:39<1:19:37, 11.06s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1609, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(1.1147, device='cuda:0', grad_fn=) [2024-06-18 22:45:51,602] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3563.50 | bwd_microstep: 1890.75 | bwd_inner_microstep: 1885.47 | bwd_allreduce_microstep: 5.14 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9925, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.9520, device='cuda:0', grad_fn=) [2024-06-18 22:45:56,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.94 [2024-06-18 22:45:56,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2776.97 | bwd_microstep: 1842.98 | bwd_inner_microstep: 1837.37 | bwd_allreduce_microstep: 5.48 | step_microstep: 62.52 [2024-06-18 22:45:56,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6340.46 | bwd: 3733.72 | bwd_inner: 3722.88 | bwd_allreduce: 10.64 | step: 62.60 28%|██▊ | 169/600 [30:49<1:17:53, 10.84s/it] {'loss': 1.0334, 'learning_rate': 8.429026289696091e-05, 'epoch': 1.69} 28%|██▊ | 169/600 [30:49<1:17:53, 10.84s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0709, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1448, device='cuda:0', grad_fn=) [2024-06-18 22:46:01,680] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3474.27 | bwd_microstep: 1725.44 | bwd_inner_microstep: 1720.36 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7152, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.7022, device='cuda:0', grad_fn=) [2024-06-18 22:46:07,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:46:07,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3568.61 | bwd_microstep: 1933.72 | bwd_inner_microstep: 1928.13 | bwd_allreduce_microstep: 5.48 | step_microstep: 61.92 [2024-06-18 22:46:07,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7042.86 | bwd: 3659.16 | bwd_inner: 3648.55 | bwd_allreduce: 10.38 | step: 62.01 28%|██▊ | 170/600 [31:00<1:17:57, 10.88s/it] {'loss': 0.4235, 'learning_rate': 8.40933377964806e-05, 'epoch': 1.7} 28%|██▊ | 170/600 [31:00<1:17:57, 10.88s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2135, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(1.1625, device='cuda:0', grad_fn=) [2024-06-18 22:46:12,997] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3584.86 | bwd_microstep: 1957.12 | bwd_inner_microstep: 1952.05 | bwd_allreduce_microstep: 4.96 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8876, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.8684, device='cuda:0', grad_fn=) [2024-06-18 22:46:18,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 22:46:18,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.76 | bwd_microstep: 1898.57 | bwd_inner_microstep: 1892.88 | bwd_allreduce_microstep: 5.50 | step_microstep: 61.74 [2024-06-18 22:46:18,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7144.58 | bwd: 3855.68 | bwd_inner: 3844.99 | bwd_allreduce: 10.45 | step: 61.83 28%|██▊ | 171/600 [31:11<1:18:36, 10.99s/it] {'loss': 1.0154, 'learning_rate': 8.389541930037516e-05, 'epoch': 1.71} 28%|██▊ | 171/600 [31:11<1:18:36, 10.99s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1672, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.2312, device='cuda:0', grad_fn=) [2024-06-18 22:46:24,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3505.93 | bwd_microstep: 1801.32 | bwd_inner_microstep: 1796.32 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0196, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0987, device='cuda:0', grad_fn=) [2024-06-18 22:46:29,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 22:46:29,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3483.79 | bwd_microstep: 1743.07 | bwd_inner_microstep: 1737.50 | bwd_allreduce_microstep: 5.45 | step_microstep: 62.01 [2024-06-18 22:46:29,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6989.70 | bwd: 3544.38 | bwd_inner: 3533.84 | bwd_allreduce: 10.35 | step: 62.10 29%|██▊ | 172/600 [31:22<1:17:58, 10.93s/it] {'loss': 0.165, 'learning_rate': 8.369651317550054e-05, 'epoch': 1.72} 29%|██▊ | 172/600 [31:22<1:17:58, 10.93s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0628, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(1.0268, device='cuda:0', grad_fn=) [2024-06-18 22:46:34,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3563.98 | bwd_microstep: 1895.05 | bwd_inner_microstep: 1890.05 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9444, device='cuda:0', grad_fn=) tensor(0.5814, device='cuda:0', grad_fn=) tensor(0.9081, device='cuda:0', grad_fn=) [2024-06-18 22:46:40,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 22:46:40,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3570.81 | bwd_microstep: 1904.41 | bwd_inner_microstep: 1898.66 | bwd_allreduce_microstep: 5.63 | step_microstep: 63.06 [2024-06-18 22:46:40,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7134.78 | bwd: 3799.46 | bwd_inner: 3788.74 | bwd_allreduce: 10.52 | step: 63.14 29%|██▉ | 173/600 [31:33<1:18:21, 11.01s/it] {'loss': 0.9675, 'learning_rate': 8.349662521748977e-05, 'epoch': 1.73} 29%|██▉ | 173/600 [31:33<1:18:21, 11.01s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0219, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1008, device='cuda:0', grad_fn=) [2024-06-18 22:46:45,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3424.24 | bwd_microstep: 1638.08 | bwd_inner_microstep: 1633.07 | bwd_allreduce_microstep: 4.92 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5905, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.6013, device='cuda:0', grad_fn=) [2024-06-18 22:46:51,352] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.97 [2024-06-18 22:46:51,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3557.54 | bwd_microstep: 1892.97 | bwd_inner_microstep: 1887.17 | bwd_allreduce_microstep: 5.64 | step_microstep: 63.38 [2024-06-18 22:46:51,353] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6981.76 | bwd: 3531.03 | bwd_inner: 3520.28 | bwd_allreduce: 10.57 | step: 63.47 29%|██▉ | 174/600 [31:44<1:17:38, 10.94s/it] {'loss': 0.351, 'learning_rate': 8.329576125058406e-05, 'epoch': 1.74} 29%|██▉ | 174/600 [31:44<1:17:38, 10.94s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6497, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.6546, device='cuda:0', grad_fn=) [2024-06-18 22:46:56,909] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3555.94 | bwd_microstep: 1896.41 | bwd_inner_microstep: 1891.22 | bwd_allreduce_microstep: 5.08 | step_microstep: 0.14 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1812, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(1.1326, device='cuda:0', grad_fn=) [2024-06-18 22:47:02,552] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:47:02,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3565.08 | bwd_microstep: 1915.79 | bwd_inner_microstep: 1910.24 | bwd_allreduce_microstep: 5.44 | step_microstep: 62.28 [2024-06-18 22:47:02,553] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7121.00 | bwd: 3812.19 | bwd_inner: 3801.48 | bwd_allreduce: 10.52 | step: 62.42 29%|██▉ | 175/600 [31:55<1:18:01, 11.01s/it] {'loss': 0.8936, 'learning_rate': 8.309392712746308e-05, 'epoch': 1.75} 29%|██▉ | 175/600 [31:55<1:18:01, 11.01s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0144, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.0948, device='cuda:0', grad_fn=) [2024-06-18 22:47:07,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2695.95 | bwd_microstep: 1655.53 | bwd_inner_microstep: 1650.24 | bwd_allreduce_microstep: 5.18 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9243, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.8910, device='cuda:0', grad_fn=) [2024-06-18 22:47:11,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 22:47:11,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2958.32 | bwd_microstep: 1838.61 | bwd_inner_microstep: 1832.97 | bwd_allreduce_microstep: 5.53 | step_microstep: 62.26 [2024-06-18 22:47:11,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5654.25 | bwd: 3494.14 | bwd_inner: 3483.22 | bwd_allreduce: 10.72 | step: 62.35 29%|██▉ | 176/600 [32:05<1:14:26, 10.53s/it] {'loss': 0.4929, 'learning_rate': 8.289112872907454e-05, 'epoch': 1.76} 29%|██▉ | 176/600 [32:05<1:14:26, 10.53s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0868, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0481, device='cuda:0', grad_fn=) [2024-06-18 22:47:17,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3584.32 | bwd_microstep: 1952.08 | bwd_inner_microstep: 1947.03 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6296, device='cuda:0', grad_fn=) tensor(0.7039, device='cuda:0', grad_fn=) tensor(0.6370, device='cuda:0', grad_fn=) [2024-06-18 22:47:23,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 22:47:23,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3560.95 | bwd_microstep: 1894.55 | bwd_inner_microstep: 1888.89 | bwd_allreduce_microstep: 5.52 | step_microstep: 61.98 [2024-06-18 22:47:23,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7145.25 | bwd: 3846.62 | bwd_inner: 3835.96 | bwd_allreduce: 10.47 | step: 62.07 30%|██▉ | 177/600 [32:16<1:15:47, 10.75s/it] {'loss': 0.8426, 'learning_rate': 8.268737196446264e-05, 'epoch': 1.77} 30%|██▉ | 177/600 [32:16<1:15:47, 10.75s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.2298, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.2879, device='cuda:0', grad_fn=) [2024-06-18 22:47:28,507] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3475.68 | bwd_microstep: 1724.28 | bwd_inner_microstep: 1719.27 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.4128, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.3299, device='cuda:0', grad_fn=) [2024-06-18 22:47:33,258] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.93 [2024-06-18 22:47:33,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2778.09 | bwd_microstep: 1811.84 | bwd_inner_microstep: 1806.10 | bwd_allreduce_microstep: 5.62 | step_microstep: 62.19 [2024-06-18 22:47:33,259] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6253.75 | bwd: 3536.12 | bwd_inner: 3525.43 | bwd_allreduce: 10.45 | step: 62.26 30%|██▉ | 178/600 [32:26<1:14:06, 10.54s/it] {'loss': 0.8089, 'learning_rate': 8.248266277059607e-05, 'epoch': 1.78} 30%|██▉ | 178/600 [32:26<1:14:06, 10.54s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0506, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1271, device='cuda:0', grad_fn=) [2024-06-18 22:47:38,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3483.48 | bwd_microstep: 1744.99 | bwd_inner_microstep: 1739.91 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8240, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8004, device='cuda:0', grad_fn=) [2024-06-18 22:47:44,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.95 [2024-06-18 22:47:44,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3570.53 | bwd_microstep: 1923.50 | bwd_inner_microstep: 1917.85 | bwd_allreduce_microstep: 5.54 | step_microstep: 62.15 [2024-06-18 22:47:44,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7053.98 | bwd: 3668.48 | bwd_inner: 3657.82 | bwd_allreduce: 10.44 | step: 62.23 30%|██▉ | 179/600 [32:37<1:14:52, 10.67s/it] {'loss': 0.4637, 'learning_rate': 8.227700711219493e-05, 'epoch': 1.79} 30%|██▉ | 179/600 [32:37<1:14:52, 10.67s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0465, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.1225, device='cuda:0', grad_fn=) [2024-06-18 22:47:48,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2659.27 | bwd_microstep: 1604.29 | bwd_inner_microstep: 1599.32 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0268, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1052, device='cuda:0', grad_fn=) [2024-06-18 22:47:53,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.94 [2024-06-18 22:47:53,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3486.23 | bwd_microstep: 1745.55 | bwd_inner_microstep: 1739.95 | bwd_allreduce_microstep: 5.48 | step_microstep: 62.07 [2024-06-18 22:47:53,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6145.49 | bwd: 3349.83 | bwd_inner: 3339.28 | bwd_allreduce: 10.35 | step: 62.15 30%|███ | 180/600 [32:47<1:12:44, 10.39s/it] {'loss': 0.1139, 'learning_rate': 8.2070410981557e-05, 'epoch': 1.8} 30%|███ | 180/600 [32:47<1:12:44, 10.39s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0203, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0994, device='cuda:0', grad_fn=) [2024-06-18 22:47:59,293] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3483.15 | bwd_microstep: 1738.85 | bwd_inner_microstep: 1733.74 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0023, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0831, device='cuda:0', grad_fn=) [2024-06-18 22:48:04,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 22:48:04,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3483.84 | bwd_microstep: 1740.05 | bwd_inner_microstep: 1734.48 | bwd_allreduce_microstep: 5.46 | step_microstep: 61.62 [2024-06-18 22:48:04,668] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6966.97 | bwd: 3478.89 | bwd_inner: 3468.28 | bwd_allreduce: 10.41 | step: 61.70 30%|███ | 181/600 [32:57<1:13:11, 10.48s/it] {'loss': 0.0912, 'learning_rate': 8.186288039838304e-05, 'epoch': 1.81} 30%|███ | 181/600 [32:57<1:13:11, 10.48s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9308, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9077, device='cuda:0', grad_fn=) [2024-06-18 22:48:10,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3557.27 | bwd_microstep: 1888.74 | bwd_inner_microstep: 1883.66 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1785, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(1.1198, device='cuda:0', grad_fn=) [2024-06-18 22:48:15,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:48:15,868] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3567.24 | bwd_microstep: 1924.97 | bwd_inner_microstep: 1919.38 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.97 [2024-06-18 22:48:15,869] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7124.50 | bwd: 3813.69 | bwd_inner: 3803.09 | bwd_allreduce: 10.40 | step: 62.05 30%|███ | 182/600 [33:09<1:14:31, 10.70s/it] {'loss': 1.0137, 'learning_rate': 8.16544214096015e-05, 'epoch': 1.82} 30%|███ | 182/600 [33:09<1:14:31, 10.70s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(1.4232, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.3509, device='cuda:0', grad_fn=) [2024-06-18 22:48:20,768] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2955.98 | bwd_microstep: 1843.69 | bwd_inner_microstep: 1838.61 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0103, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.0911, device='cuda:0', grad_fn=) [2024-06-18 22:48:26,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 22:48:26,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3509.26 | bwd_microstep: 1809.17 | bwd_inner_microstep: 1803.46 | bwd_allreduce_microstep: 5.60 | step_microstep: 62.21 [2024-06-18 22:48:26,241] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6465.23 | bwd: 3652.86 | bwd_inner: 3642.13 | bwd_allreduce: 10.50 | step: 62.29 30%|███ | 183/600 [33:19<1:13:39, 10.60s/it] {'loss': 0.721, 'learning_rate': 8.144504008919222e-05, 'epoch': 1.83} 30%|███ | 183/600 [33:19<1:13:39, 10.60s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0030, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0838, device='cuda:0', grad_fn=) [2024-06-18 22:48:31,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3488.95 | bwd_microstep: 1745.45 | bwd_inner_microstep: 1740.37 | bwd_allreduce_microstep: 4.97 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1959, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.1348, device='cuda:0', grad_fn=) [2024-06-18 22:48:36,526] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 22:48:36,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2955.20 | bwd_microstep: 1837.00 | bwd_inner_microstep: 1831.47 | bwd_allreduce_microstep: 5.41 | step_microstep: 61.84 [2024-06-18 22:48:36,527] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6444.13 | bwd: 3582.44 | bwd_inner: 3571.85 | bwd_allreduce: 10.39 | step: 61.92 31%|███ | 184/600 [33:29<1:12:50, 10.51s/it] {'loss': 0.6093, 'learning_rate': 8.123474253800957e-05, 'epoch': 1.84} 31%|███ | 184/600 [33:29<1:12:50, 10.51s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0656, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(1.0286, device='cuda:0', grad_fn=) [2024-06-18 22:48:40,897] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2661.75 | bwd_microstep: 1613.47 | bwd_inner_microstep: 1608.47 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.3277, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(1.2648, device='cuda:0', grad_fn=) [2024-06-18 22:48:46,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:48:46,648] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3598.58 | bwd_microstep: 1984.95 | bwd_inner_microstep: 1979.37 | bwd_allreduce_microstep: 5.49 | step_microstep: 61.87 [2024-06-18 22:48:46,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6260.32 | bwd: 3598.41 | bwd_inner: 3587.85 | bwd_allreduce: 10.37 | step: 61.95 31%|███ | 185/600 [33:39<1:11:51, 10.39s/it] {'loss': 1.1467, 'learning_rate': 8.102353488360454e-05, 'epoch': 1.85} 31%|███ | 185/600 [33:39<1:11:51, 10.39s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0757, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0266, device='cuda:0', grad_fn=) [2024-06-18 22:48:52,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3573.67 | bwd_microstep: 1934.37 | bwd_inner_microstep: 1929.27 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.09 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1093, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(1.0575, device='cuda:0', grad_fn=) [2024-06-18 22:48:57,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:48:57,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3563.01 | bwd_microstep: 1904.40 | bwd_inner_microstep: 1898.83 | bwd_allreduce_microstep: 5.46 | step_microstep: 61.85 [2024-06-18 22:48:57,894] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7136.67 | bwd: 3838.77 | bwd_inner: 3828.15 | bwd_allreduce: 10.42 | step: 61.94 31%|███ | 186/600 [33:51<1:13:27, 10.65s/it] {'loss': 1.042, 'learning_rate': 8.081142328004637e-05, 'epoch': 1.86} 31%|███ | 186/600 [33:51<1:13:27, 10.65s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0263, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1048, device='cuda:0', grad_fn=) [2024-06-18 22:49:03,298] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3506.58 | bwd_microstep: 1803.66 | bwd_inner_microstep: 1798.61 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0882, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(1.0381, device='cuda:0', grad_fn=) [2024-06-18 22:49:09,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.96 [2024-06-18 22:49:09,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3593.02 | bwd_microstep: 1977.13 | bwd_inner_microstep: 1971.27 | bwd_allreduce_microstep: 5.74 | step_microstep: 64.70 [2024-06-18 22:49:09,040] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7099.56 | bwd: 3780.79 | bwd_inner: 3769.90 | bwd_allreduce: 10.69 | step: 64.78 31%|███ | 187/600 [34:02<1:14:19, 10.80s/it] {'loss': 0.5714, 'learning_rate': 8.059841390774307e-05, 'epoch': 1.87} 31%|███ | 187/600 [34:02<1:14:19, 10.80s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8724, device='cuda:0', grad_fn=) tensor(0.7072, device='cuda:0', grad_fn=) tensor(0.8558, device='cuda:0', grad_fn=) [2024-06-18 22:49:14,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3562.25 | bwd_microstep: 1907.31 | bwd_inner_microstep: 1902.28 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4390, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.4651, device='cuda:0', grad_fn=) [2024-06-18 22:49:20,198] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:49:20,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3552.35 | bwd_microstep: 1880.53 | bwd_inner_microstep: 1874.86 | bwd_allreduce_microstep: 5.56 | step_microstep: 61.80 [2024-06-18 22:49:20,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7114.58 | bwd: 3787.83 | bwd_inner: 3777.19 | bwd_allreduce: 10.43 | step: 61.89 31%|███▏ | 188/600 [34:13<1:14:52, 10.91s/it] {'loss': 0.6604, 'learning_rate': 8.038451297326145e-05, 'epoch': 1.88} 31%|███▏ | 188/600 [34:13<1:14:52, 10.91s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0689, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0205, device='cuda:0', grad_fn=) [2024-06-18 22:49:24,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2764.85 | bwd_microstep: 1811.49 | bwd_inner_microstep: 1806.38 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0981, device='cuda:0', grad_fn=) tensor(0.5814, device='cuda:0', grad_fn=) tensor(1.0464, device='cuda:0', grad_fn=) [2024-06-18 22:49:30,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.94 [2024-06-18 22:49:30,550] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3568.59 | bwd_microstep: 1936.35 | bwd_inner_microstep: 1930.65 | bwd_allreduce_microstep: 5.59 | step_microstep: 62.95 [2024-06-18 22:49:30,551] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6333.41 | bwd: 3747.83 | bwd_inner: 3737.09 | bwd_allreduce: 10.54 | step: 63.03 32%|███▏ | 189/600 [34:23<1:13:33, 10.74s/it] {'loss': 1.0334, 'learning_rate': 8.016972670914624e-05, 'epoch': 1.89} 32%|███▏ | 189/600 [34:23<1:13:33, 10.74s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9869, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.9577, device='cuda:0', grad_fn=) [2024-06-18 22:49:36,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3554.69 | bwd_microstep: 1890.73 | bwd_inner_microstep: 1885.75 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9484, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.9127, device='cuda:0', grad_fn=) [2024-06-18 22:49:41,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:49:41,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3574.76 | bwd_microstep: 1942.25 | bwd_inner_microstep: 1936.71 | bwd_allreduce_microstep: 5.43 | step_microstep: 62.19 [2024-06-18 22:49:41,782] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7129.44 | bwd: 3832.98 | bwd_inner: 3822.46 | bwd_allreduce: 10.32 | step: 62.27 32%|███▏ | 190/600 [34:34<1:14:23, 10.89s/it] {'loss': 0.9352, 'learning_rate': 7.995406137373846e-05, 'epoch': 1.9} 32%|███▏ | 190/600 [34:34<1:14:23, 10.89s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0347, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1128, device='cuda:0', grad_fn=) [2024-06-18 22:49:47,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3502.09 | bwd_microstep: 1807.68 | bwd_inner_microstep: 1802.62 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6777, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.6687, device='cuda:0', grad_fn=) [2024-06-18 22:49:52,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 22:49:52,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3591.50 | bwd_microstep: 1969.15 | bwd_inner_microstep: 1963.56 | bwd_allreduce_microstep: 5.40 | step_microstep: 62.11 [2024-06-18 22:49:52,915] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7093.57 | bwd: 3776.83 | bwd_inner: 3766.24 | bwd_allreduce: 10.37 | step: 62.19 32%|███▏ | 191/600 [34:46<1:14:42, 10.96s/it] {'loss': 0.3907, 'learning_rate': 7.973752325099314e-05, 'epoch': 1.91} 32%|███▏ | 191/600 [34:46<1:14:42, 10.96s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6885, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.6896, device='cuda:0', grad_fn=) [2024-06-18 22:49:58,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3561.32 | bwd_microstep: 1912.20 | bwd_inner_microstep: 1907.20 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0041, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0847, device='cuda:0', grad_fn=) [2024-06-18 22:50:03,796] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 22:50:03,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3461.87 | bwd_microstep: 1690.24 | bwd_inner_microstep: 1684.46 | bwd_allreduce_microstep: 5.66 | step_microstep: 62.82 [2024-06-18 22:50:03,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7023.18 | bwd: 3602.43 | bwd_inner: 3591.68 | bwd_allreduce: 10.55 | step: 62.90 32%|███▏ | 192/600 [34:56<1:14:22, 10.94s/it] {'loss': 0.3872, 'learning_rate': 7.952011865029614e-05, 'epoch': 1.92} 32%|███▏ | 192/600 [34:56<1:14:22, 10.94s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0343, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1123, device='cuda:0', grad_fn=) [2024-06-18 22:50:09,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3484.10 | bwd_microstep: 1742.45 | bwd_inner_microstep: 1737.40 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9322, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.8981, device='cuda:0', grad_fn=) [2024-06-18 22:50:14,745] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 22:50:14,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3560.16 | bwd_microstep: 1903.13 | bwd_inner_microstep: 1897.60 | bwd_allreduce_microstep: 5.41 | step_microstep: 61.57 [2024-06-18 22:50:14,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7044.24 | bwd: 3645.59 | bwd_inner: 3635.02 | bwd_allreduce: 10.36 | step: 61.66 32%|███▏ | 193/600 [35:07<1:14:12, 10.94s/it] {'loss': 0.5052, 'learning_rate': 7.930185390628035e-05, 'epoch': 1.93} 32%|███▏ | 193/600 [35:07<1:14:12, 10.94s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7757, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.7680, device='cuda:0', grad_fn=) [2024-06-18 22:50:20,447] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3714.44 | bwd_microstep: 1887.42 | bwd_inner_microstep: 1882.40 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0831, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(1.0336, device='cuda:0', grad_fn=) [2024-06-18 22:50:26,080] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 22:50:26,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3561.36 | bwd_microstep: 1907.38 | bwd_inner_microstep: 1901.74 | bwd_allreduce_microstep: 5.45 | step_microstep: 62.32 [2024-06-18 22:50:26,081] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7275.76 | bwd: 3794.80 | bwd_inner: 3784.23 | bwd_allreduce: 10.30 | step: 62.40 32%|███▏ | 194/600 [35:19<1:14:49, 11.06s/it] {'loss': 0.9008, 'learning_rate': 7.908273537864113e-05, 'epoch': 1.94} 32%|███▏ | 194/600 [35:19<1:14:49, 11.06s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.0499, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1264, device='cuda:0', grad_fn=) [2024-06-18 22:50:30,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2844.51 | bwd_microstep: 1635.67 | bwd_inner_microstep: 1630.57 | bwd_allreduce_microstep: 4.98 | step_microstep: 0.09 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0036, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0847, device='cuda:0', grad_fn=) [2024-06-18 22:50:36,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:50:36,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3506.44 | bwd_microstep: 1807.60 | bwd_inner_microstep: 1802.11 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.91 [2024-06-18 22:50:36,121] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6350.94 | bwd: 3443.27 | bwd_inner: 3432.69 | bwd_allreduce: 10.39 | step: 62.00 32%|███▎ | 195/600 [35:29<1:12:35, 10.75s/it] {'loss': 0.1056, 'learning_rate': 7.886276945195099e-05, 'epoch': 1.95} 32%|███▎ | 195/600 [35:29<1:12:35, 10.75s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8434, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8290, device='cuda:0', grad_fn=) [2024-06-18 22:50:41,722] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3569.41 | bwd_microstep: 1926.88 | bwd_inner_microstep: 1921.92 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8127, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.7899, device='cuda:0', grad_fn=) [2024-06-18 22:50:47,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 22:50:47,390] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3572.08 | bwd_microstep: 1930.91 | bwd_inner_microstep: 1925.44 | bwd_allreduce_microstep: 5.36 | step_microstep: 61.99 [2024-06-18 22:50:47,390] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7141.48 | bwd: 3857.78 | bwd_inner: 3847.37 | bwd_allreduce: 10.23 | step: 62.07 33%|███▎ | 196/600 [35:40<1:13:26, 10.91s/it] {'loss': 0.8095, 'learning_rate': 7.86419625354735e-05, 'epoch': 1.96} 33%|███▎ | 196/600 [35:40<1:13:26, 10.91s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.3966, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(1.3265, device='cuda:0', grad_fn=) [2024-06-18 22:50:51,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2849.71 | bwd_microstep: 1637.54 | bwd_inner_microstep: 1632.44 | bwd_allreduce_microstep: 4.92 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0168, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.9739, device='cuda:0', grad_fn=) [2024-06-18 22:50:57,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.95 [2024-06-18 22:50:57,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3568.84 | bwd_microstep: 1928.50 | bwd_inner_microstep: 1922.67 | bwd_allreduce_microstep: 5.66 | step_microstep: 62.49 [2024-06-18 22:50:57,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6418.53 | bwd: 3566.03 | bwd_inner: 3555.20 | bwd_allreduce: 10.58 | step: 62.58 33%|███▎ | 197/600 [35:50<1:11:55, 10.71s/it] {'loss': 1.1502, 'learning_rate': 7.842032106297666e-05, 'epoch': 1.97} 33%|███▎ | 197/600 [35:50<1:11:55, 10.71s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.4015, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.3313, device='cuda:0', grad_fn=) [2024-06-18 22:51:03,288] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3589.85 | bwd_microstep: 1957.93 | bwd_inner_microstep: 1952.91 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9912, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9506, device='cuda:0', grad_fn=) [2024-06-18 22:51:08,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.99 [2024-06-18 22:51:08,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3563.65 | bwd_microstep: 1903.95 | bwd_inner_microstep: 1898.32 | bwd_allreduce_microstep: 5.52 | step_microstep: 62.94 [2024-06-18 22:51:08,919] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7153.49 | bwd: 3861.87 | bwd_inner: 3851.25 | bwd_allreduce: 10.43 | step: 63.03 33%|███▎ | 198/600 [36:02<1:12:54, 10.88s/it] {'loss': 1.1409, 'learning_rate': 7.819785149254532e-05, 'epoch': 1.98} 33%|███▎ | 198/600 [36:02<1:12:54, 10.88s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0189, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0985, device='cuda:0', grad_fn=) [2024-06-18 22:51:13,569] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2877.15 | bwd_microstep: 1678.37 | bwd_inner_microstep: 1673.31 | bwd_allreduce_microstep: 4.96 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0080, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0882, device='cuda:0', grad_fn=) [2024-06-18 22:51:19,041] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 22:51:19,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3508.53 | bwd_microstep: 1808.80 | bwd_inner_microstep: 1803.22 | bwd_allreduce_microstep: 5.47 | step_microstep: 62.15 [2024-06-18 22:51:19,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6385.68 | bwd: 3487.17 | bwd_inner: 3476.54 | bwd_allreduce: 10.43 | step: 62.23 33%|███▎ | 199/600 [36:12<1:11:12, 10.65s/it] {'loss': 0.0934, 'learning_rate': 7.797456030639313e-05, 'epoch': 1.99} 33%|███▎ | 199/600 [36:12<1:11:12, 10.65s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9264, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.9041, device='cuda:0', grad_fn=) [2024-06-18 22:51:24,688] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3588.43 | bwd_microstep: 1953.09 | bwd_inner_microstep: 1948.04 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.09 please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0924, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0416, device='cuda:0', grad_fn=) [2024-06-18 22:51:31,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:51:31,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3568.59 | bwd_microstep: 1929.31 | bwd_inner_microstep: 1923.65 | bwd_allreduce_microstep: 5.55 | step_microstep: 62.31 [2024-06-18 22:51:31,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7156.96 | bwd: 3882.39 | bwd_inner: 3871.74 | bwd_allreduce: 10.46 | step: 62.41 33%|███▎ | 200/600 [36:24<1:13:59, 11.10s/it] {'loss': 0.9729, 'learning_rate': 7.77504540106735e-05, 'epoch': 2.0} 33%|███▎ | 200/600 [36:24<1:13:59, 11.10s/it][2024-06-18 22:51:33,909] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-18 22:51:39,726] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-18 22:51:45,584] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-18 22:51:51,393] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0171, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0969, device='cuda:0', grad_fn=) [2024-06-18 22:52:00,267] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3506.54 | bwd_microstep: 1800.31 | bwd_inner_microstep: 1795.16 | bwd_allreduce_microstep: 5.03 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.3105, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.2494, device='cuda:0', grad_fn=) [2024-06-18 22:52:05,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:52:05,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3575.41 | bwd_microstep: 1958.98 | bwd_inner_microstep: 1953.46 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.78 [2024-06-18 22:52:05,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7081.92 | bwd: 3759.28 | bwd_inner: 3748.63 | bwd_allreduce: 10.45 | step: 61.86 34%|███▎ | 201/600 [36:59<2:01:03, 18.21s/it] {'loss': 0.6731, 'learning_rate': 7.752553913529018e-05, 'epoch': 2.01} 34%|███▎ | 201/600 [36:59<2:01:03, 18.21s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2792, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(1.2216, device='cuda:0', grad_fn=) [2024-06-18 22:52:11,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3575.06 | bwd_microstep: 1952.94 | bwd_inner_microstep: 1947.96 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0806, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1536, device='cuda:0', grad_fn=) [2024-06-18 22:52:17,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 22:52:17,047] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3494.30 | bwd_microstep: 1803.96 | bwd_inner_microstep: 1798.44 | bwd_allreduce_microstep: 5.41 | step_microstep: 61.18 [2024-06-18 22:52:17,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7069.32 | bwd: 3756.89 | bwd_inner: 3746.39 | bwd_allreduce: 10.34 | step: 61.26 34%|███▎ | 202/600 [37:10<1:46:35, 16.07s/it] {'loss': 0.6876, 'learning_rate': 7.729982223370691e-05, 'epoch': 2.02} 34%|███▎ | 202/600 [37:10<1:46:35, 16.07s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8793, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8614, device='cuda:0', grad_fn=) [2024-06-18 22:52:22,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3576.06 | bwd_microstep: 1952.29 | bwd_inner_microstep: 1947.34 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8824, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8530, device='cuda:0', grad_fn=) [2024-06-18 22:52:28,344] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:52:28,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3564.63 | bwd_microstep: 1937.01 | bwd_inner_microstep: 1931.22 | bwd_allreduce_microstep: 5.68 | step_microstep: 62.07 [2024-06-18 22:52:28,345] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7140.65 | bwd: 3889.29 | bwd_inner: 3878.55 | bwd_allreduce: 10.57 | step: 62.15 34%|███▍ | 203/600 [37:21<1:36:50, 14.64s/it] {'loss': 0.8572, 'learning_rate': 7.707330988275651e-05, 'epoch': 2.03} 34%|███▍ | 203/600 [37:21<1:36:50, 14.64s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5465, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5614, device='cuda:0', grad_fn=) [2024-06-18 22:52:33,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3558.72 | bwd_microstep: 1923.14 | bwd_inner_microstep: 1918.06 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0067, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.9651, device='cuda:0', grad_fn=) [2024-06-18 22:52:39,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:52:39,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3560.21 | bwd_microstep: 1926.05 | bwd_inner_microstep: 1920.65 | bwd_allreduce_microstep: 5.32 | step_microstep: 61.30 [2024-06-18 22:52:39,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7118.88 | bwd: 3849.18 | bwd_inner: 3838.73 | bwd_allreduce: 10.23 | step: 61.38 34%|███▍ | 204/600 [37:32<1:29:52, 13.62s/it] {'loss': 0.7632, 'learning_rate': 7.68460086824492e-05, 'epoch': 2.04} 34%|███▍ | 204/600 [37:32<1:29:52, 13.62s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9830, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9547, device='cuda:0', grad_fn=) [2024-06-18 22:52:45,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3545.22 | bwd_microstep: 1885.22 | bwd_inner_microstep: 1880.15 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7363, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7214, device='cuda:0', grad_fn=) [2024-06-18 22:52:50,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:52:50,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3589.77 | bwd_microstep: 1977.34 | bwd_inner_microstep: 1971.73 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.48 [2024-06-18 22:52:50,846] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7134.97 | bwd: 3862.55 | bwd_inner: 3851.97 | bwd_allreduce: 10.30 | step: 61.56 34%|███▍ | 205/600 [37:44<1:24:59, 12.91s/it] {'loss': 0.8381, 'learning_rate': 7.661792525578035e-05, 'epoch': 2.05} 34%|███▍ | 205/600 [37:44<1:24:59, 12.91s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0751, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1486, device='cuda:0', grad_fn=) [2024-06-18 22:52:56,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3469.61 | bwd_microstep: 1725.50 | bwd_inner_microstep: 1720.44 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0639, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1390, device='cuda:0', grad_fn=) [2024-06-18 22:53:01,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 22:53:01,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3503.37 | bwd_microstep: 1802.00 | bwd_inner_microstep: 1796.28 | bwd_allreduce_microstep: 5.56 | step_microstep: 61.76 [2024-06-18 22:53:01,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6972.97 | bwd: 3527.50 | bwd_inner: 3516.81 | bwd_allreduce: 10.43 | step: 61.84 34%|███▍ | 206/600 [37:54<1:20:30, 12.26s/it] {'loss': 0.1438, 'learning_rate': 7.638906624853743e-05, 'epoch': 2.06} 34%|███▍ | 206/600 [37:54<1:20:30, 12.26s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7994, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7894, device='cuda:0', grad_fn=) [2024-06-18 22:53:06,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2686.32 | bwd_microstep: 1662.55 | bwd_inner_microstep: 1657.61 | bwd_allreduce_microstep: 4.83 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1442, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0883, device='cuda:0', grad_fn=) [2024-06-18 22:53:11,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 22:53:11,682] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3558.37 | bwd_microstep: 1925.92 | bwd_inner_microstep: 1920.32 | bwd_allreduce_microstep: 5.48 | step_microstep: 61.67 [2024-06-18 22:53:11,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6244.68 | bwd: 3588.46 | bwd_inner: 3577.95 | bwd_allreduce: 10.32 | step: 61.76 34%|███▍ | 207/600 [38:04<1:16:02, 11.61s/it] {'loss': 0.9388, 'learning_rate': 7.61594383291065e-05, 'epoch': 2.07} 34%|███▍ | 207/600 [38:04<1:16:02, 11.61s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6594, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.6519, device='cuda:0', grad_fn=) [2024-06-18 22:53:17,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3582.54 | bwd_microstep: 1974.49 | bwd_inner_microstep: 1969.55 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8916, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.8728, device='cuda:0', grad_fn=) [2024-06-18 22:53:22,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 22:53:22,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3557.71 | bwd_microstep: 1896.92 | bwd_inner_microstep: 1891.30 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.42 [2024-06-18 22:53:22,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7140.22 | bwd: 3871.41 | bwd_inner: 3860.88 | bwd_allreduce: 10.33 | step: 61.50 35%|███▍ | 208/600 [38:16<1:15:12, 11.51s/it] {'loss': 0.7623, 'learning_rate': 7.592904818827775e-05, 'epoch': 2.08} 35%|███▍ | 208/600 [38:16<1:15:12, 11.51s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9443, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9198, device='cuda:0', grad_fn=) [2024-06-18 22:53:28,512] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3551.49 | bwd_microstep: 1894.94 | bwd_inner_microstep: 1889.72 | bwd_allreduce_microstep: 5.11 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4992, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5189, device='cuda:0', grad_fn=) [2024-06-18 22:53:34,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 22:53:34,110] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3550.00 | bwd_microstep: 1892.35 | bwd_inner_microstep: 1886.81 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.00 [2024-06-18 22:53:34,111] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7101.47 | bwd: 3787.29 | bwd_inner: 3776.55 | bwd_allreduce: 10.55 | step: 61.09 35%|███▍ | 209/600 [38:27<1:14:18, 11.40s/it] {'loss': 0.7193, 'learning_rate': 7.569790253905059e-05, 'epoch': 2.09} 35%|███▍ | 209/600 [38:27<1:14:18, 11.40s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0343, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1119, device='cuda:0', grad_fn=) [2024-06-18 22:53:39,396] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3471.47 | bwd_microstep: 1724.73 | bwd_inner_microstep: 1719.75 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8534, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.8265, device='cuda:0', grad_fn=) [2024-06-18 22:53:45,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 22:53:45,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3565.67 | bwd_microstep: 1938.75 | bwd_inner_microstep: 1933.20 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.11 [2024-06-18 22:53:45,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7037.12 | bwd: 3663.48 | bwd_inner: 3652.94 | bwd_allreduce: 10.35 | step: 61.19 35%|███▌ | 210/600 [38:38<1:13:14, 11.27s/it] {'loss': 0.4692, 'learning_rate': 7.546600811643816e-05, 'epoch': 2.1} 35%|███▌ | 210/600 [38:38<1:13:14, 11.27s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0364, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.9916, device='cuda:0', grad_fn=) [2024-06-18 22:53:50,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3603.32 | bwd_microstep: 1991.22 | bwd_inner_microstep: 1986.04 | bwd_allreduce_microstep: 5.06 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7262, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7123, device='cuda:0', grad_fn=) [2024-06-18 22:53:56,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:53:56,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3585.79 | bwd_microstep: 1975.12 | bwd_inner_microstep: 1969.64 | bwd_allreduce_microstep: 5.37 | step_microstep: 61.48 [2024-06-18 22:53:56,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7189.08 | bwd: 3966.34 | bwd_inner: 3955.70 | bwd_allreduce: 10.44 | step: 61.56 35%|███▌ | 211/600 [38:49<1:13:22, 11.32s/it] {'loss': 0.852, 'learning_rate': 7.523337167727095e-05, 'epoch': 2.11} 35%|███▌ | 211/600 [38:49<1:13:22, 11.32s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0041, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0847, device='cuda:0', grad_fn=) [2024-06-18 22:54:01,903] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3499.32 | bwd_microstep: 1805.19 | bwd_inner_microstep: 1800.15 | bwd_allreduce_microstep: 4.93 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8961, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8764, device='cuda:0', grad_fn=) [2024-06-18 22:54:06,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:54:06,488] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2704.23 | bwd_microstep: 1725.23 | bwd_inner_microstep: 1719.71 | bwd_allreduce_microstep: 5.41 | step_microstep: 61.60 [2024-06-18 22:54:06,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6203.52 | bwd: 3530.41 | bwd_inner: 3519.88 | bwd_allreduce: 10.34 | step: 61.68 35%|███▌ | 212/600 [38:59<1:10:36, 10.92s/it] {'loss': 0.4806, 'learning_rate': 7.500000000000001e-05, 'epoch': 2.12} 35%|███▌ | 212/600 [38:59<1:10:36, 10.92s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9822, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9424, device='cuda:0', grad_fn=) [2024-06-18 22:54:12,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3585.52 | bwd_microstep: 1971.71 | bwd_inner_microstep: 1966.71 | bwd_allreduce_microstep: 4.92 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0143, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.9717, device='cuda:0', grad_fn=) [2024-06-18 22:54:16,931] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:54:16,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2766.51 | bwd_microstep: 1844.49 | bwd_inner_microstep: 1838.99 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.11 [2024-06-18 22:54:16,932] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6352.00 | bwd: 3816.19 | bwd_inner: 3805.70 | bwd_allreduce: 10.32 | step: 61.19 36%|███▌ | 213/600 [39:10<1:09:30, 10.78s/it] {'loss': 0.9571, 'learning_rate': 7.476589988449939e-05, 'epoch': 2.13} 36%|███▌ | 213/600 [39:10<1:09:30, 10.78s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.8150, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.8038, device='cuda:0', grad_fn=) [2024-06-18 22:54:21,685] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2783.10 | bwd_microstep: 1866.44 | bwd_inner_microstep: 1861.36 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9502, device='cuda:0', grad_fn=) tensor(0.6965, device='cuda:0', grad_fn=) tensor(0.9248, device='cuda:0', grad_fn=) [2024-06-18 22:54:26,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:54:26,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2681.16 | bwd_microstep: 1660.69 | bwd_inner_microstep: 1655.24 | bwd_allreduce_microstep: 5.35 | step_microstep: 61.49 [2024-06-18 22:54:26,182] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5464.25 | bwd: 3527.13 | bwd_inner: 3516.65 | bwd_allreduce: 10.26 | step: 61.57 36%|███▌ | 214/600 [39:19<1:06:22, 10.32s/it] {'loss': 0.8643, 'learning_rate': 7.453107815186803e-05, 'epoch': 2.14} 36%|███▌ | 214/600 [39:19<1:06:22, 10.32s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0153, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0944, device='cuda:0', grad_fn=) [2024-06-18 22:54:31,492] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3477.99 | bwd_microstep: 1739.19 | bwd_inner_microstep: 1734.18 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0314, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1098, device='cuda:0', grad_fn=) [2024-06-18 22:54:36,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:54:36,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3477.79 | bwd_microstep: 1745.20 | bwd_inner_microstep: 1739.63 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.33 [2024-06-18 22:54:36,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6955.77 | bwd: 3484.38 | bwd_inner: 3473.86 | bwd_allreduce: 10.29 | step: 61.41 36%|███▌ | 215/600 [39:30<1:06:54, 10.43s/it] {'loss': 0.1021, 'learning_rate': 7.429554164423102e-05, 'epoch': 2.15} 36%|███▌ | 215/600 [39:30<1:06:54, 10.43s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9767, device='cuda:0', grad_fn=) tensor(0.7039, device='cuda:0', grad_fn=) tensor(0.9494, device='cuda:0', grad_fn=) [2024-06-18 22:54:42,520] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3589.32 | bwd_microstep: 1959.52 | bwd_inner_microstep: 1954.36 | bwd_allreduce_microstep: 5.05 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6574, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.6612, device='cuda:0', grad_fn=) [2024-06-18 22:54:48,124] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.95 [2024-06-18 22:54:48,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3552.36 | bwd_microstep: 1894.62 | bwd_inner_microstep: 1889.10 | bwd_allreduce_microstep: 5.40 | step_microstep: 62.03 [2024-06-18 22:54:48,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7141.67 | bwd: 3854.14 | bwd_inner: 3843.48 | bwd_allreduce: 10.46 | step: 62.11 36%|███▌ | 216/600 [39:41<1:08:20, 10.68s/it] {'loss': 0.8053, 'learning_rate': 7.405929722454026e-05, 'epoch': 2.16} 36%|███▌ | 216/600 [39:41<1:08:20, 10.68s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8135, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.8017, device='cuda:0', grad_fn=) [2024-06-18 22:54:53,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.40 | bwd_microstep: 1886.32 | bwd_inner_microstep: 1881.22 | bwd_allreduce_microstep: 5.00 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5875, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5987, device='cuda:0', grad_fn=) [2024-06-18 22:54:59,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:54:59,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3548.87 | bwd_microstep: 1886.41 | bwd_inner_microstep: 1880.98 | bwd_allreduce_microstep: 5.36 | step_microstep: 61.44 [2024-06-18 22:54:59,251] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7098.26 | bwd: 3772.74 | bwd_inner: 3762.20 | bwd_allreduce: 10.38 | step: 61.53 36%|███▌ | 217/600 [39:52<1:09:00, 10.81s/it] {'loss': 0.7002, 'learning_rate': 7.382235177637437e-05, 'epoch': 2.17} 36%|███▌ | 217/600 [39:52<1:09:00, 10.81s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0026, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9609, device='cuda:0', grad_fn=) [2024-06-18 22:55:04,814] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3555.21 | bwd_microstep: 1904.37 | bwd_inner_microstep: 1899.30 | bwd_allreduce_microstep: 4.98 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9712, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9441, device='cuda:0', grad_fn=) [2024-06-18 22:55:10,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 22:55:10,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3553.20 | bwd_microstep: 1898.48 | bwd_inner_microstep: 1892.89 | bwd_allreduce_microstep: 5.48 | step_microstep: 61.56 [2024-06-18 22:55:10,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7108.39 | bwd: 3802.84 | bwd_inner: 3792.18 | bwd_allreduce: 10.46 | step: 61.65 36%|███▋ | 218/600 [40:03<1:09:31, 10.92s/it] {'loss': 0.9525, 'learning_rate': 7.358471220373832e-05, 'epoch': 2.18} 36%|███▋ | 218/600 [40:03<1:09:31, 10.92s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0154, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0949, device='cuda:0', grad_fn=) [2024-06-18 22:55:15,815] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3494.28 | bwd_microstep: 1801.19 | bwd_inner_microstep: 1796.20 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9656, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.9278, device='cuda:0', grad_fn=) [2024-06-18 22:55:21,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:55:21,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3591.95 | bwd_microstep: 1982.86 | bwd_inner_microstep: 1977.35 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.58 [2024-06-18 22:55:21,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7086.17 | bwd: 3784.04 | bwd_inner: 3773.58 | bwd_allreduce: 10.27 | step: 61.66 36%|███▋ | 219/600 [40:14<1:09:45, 10.99s/it] {'loss': 0.5114, 'learning_rate': 7.334638543086203e-05, 'epoch': 2.19} 36%|███▋ | 219/600 [40:14<1:09:45, 10.99s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2068, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.1446, device='cuda:0', grad_fn=) [2024-06-18 22:55:27,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3587.68 | bwd_microstep: 1975.01 | bwd_inner_microstep: 1970.06 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(1.2141, device='cuda:0', grad_fn=) tensor(0.5814, device='cuda:0', grad_fn=) tensor(1.1508, device='cuda:0', grad_fn=) [2024-06-18 22:55:32,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 22:55:32,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2769.44 | bwd_microstep: 1840.76 | bwd_inner_microstep: 1835.04 | bwd_allreduce_microstep: 5.60 | step_microstep: 62.18 [2024-06-18 22:55:32,021] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6357.08 | bwd: 3815.76 | bwd_inner: 3805.11 | bwd_allreduce: 10.44 | step: 62.26 37%|███▋ | 220/600 [40:25<1:08:34, 10.83s/it] {'loss': 1.1477, 'learning_rate': 7.310737840199885e-05, 'epoch': 2.2} 37%|███▋ | 220/600 [40:25<1:08:34, 10.83s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2082, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.1573, device='cuda:0', grad_fn=) [2024-06-18 22:55:37,660] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3578.42 | bwd_microstep: 1956.41 | bwd_inner_microstep: 1951.32 | bwd_allreduce_microstep: 4.93 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.8087, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.7863, device='cuda:0', grad_fn=) [2024-06-18 22:55:42,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.95 [2024-06-18 22:55:42,649] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2955.47 | bwd_microstep: 1866.48 | bwd_inner_microstep: 1860.74 | bwd_allreduce_microstep: 5.58 | step_microstep: 62.57 [2024-06-18 22:55:42,650] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6533.88 | bwd: 3822.89 | bwd_inner: 3812.15 | bwd_allreduce: 10.52 | step: 62.65 37%|███▋ | 221/600 [40:35<1:08:00, 10.77s/it] {'loss': 0.9718, 'learning_rate': 7.286769808122304e-05, 'epoch': 2.21} 37%|███▋ | 221/600 [40:35<1:08:00, 10.77s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0025, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.0841, device='cuda:0', grad_fn=) [2024-06-18 22:55:47,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3476.55 | bwd_microstep: 1745.66 | bwd_inner_microstep: 1740.62 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8749, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.8573, device='cuda:0', grad_fn=) [2024-06-18 22:55:53,101] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 22:55:53,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3194.70 | bwd_microstep: 1783.41 | bwd_inner_microstep: 1777.73 | bwd_allreduce_microstep: 5.50 | step_microstep: 62.02 [2024-06-18 22:55:53,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6671.24 | bwd: 3529.05 | bwd_inner: 3518.44 | bwd_allreduce: 10.33 | step: 62.10 37%|███▋ | 222/600 [40:46<1:07:14, 10.67s/it] {'loss': 0.4707, 'learning_rate': 7.262735145222696e-05, 'epoch': 2.22} 37%|███▋ | 222/600 [40:46<1:07:14, 10.67s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6584, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6625, device='cuda:0', grad_fn=) [2024-06-18 22:55:58,737] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3576.20 | bwd_microstep: 1955.08 | bwd_inner_microstep: 1950.01 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0078, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0885, device='cuda:0', grad_fn=) [2024-06-18 22:56:04,190] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:56:04,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3497.46 | bwd_microstep: 1804.46 | bwd_inner_microstep: 1798.98 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.76 [2024-06-18 22:56:04,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7073.63 | bwd: 3759.53 | bwd_inner: 3749.04 | bwd_allreduce: 10.27 | step: 61.84 37%|███▋ | 223/600 [40:57<1:07:50, 10.80s/it] {'loss': 0.3755, 'learning_rate': 7.238634551811749e-05, 'epoch': 2.23} 37%|███▋ | 223/600 [40:57<1:07:50, 10.80s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8674, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8506, device='cuda:0', grad_fn=) [2024-06-18 22:56:09,721] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3544.86 | bwd_microstep: 1887.53 | bwd_inner_microstep: 1882.48 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0525, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1287, device='cuda:0', grad_fn=) [2024-06-18 22:56:15,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:56:15,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3497.05 | bwd_microstep: 1803.23 | bwd_inner_microstep: 1797.60 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.36 [2024-06-18 22:56:15,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7041.87 | bwd: 3690.75 | bwd_inner: 3680.17 | bwd_allreduce: 10.33 | step: 61.44 37%|███▋ | 224/600 [41:08<1:08:00, 10.85s/it] {'loss': 0.4897, 'learning_rate': 7.214468730121208e-05, 'epoch': 2.24} 37%|███▋ | 224/600 [41:08<1:08:00, 10.85s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2877, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.2289, device='cuda:0', grad_fn=) [2024-06-18 22:56:20,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3592.64 | bwd_microstep: 1994.27 | bwd_inner_microstep: 1989.21 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7994, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.7779, device='cuda:0', grad_fn=) [2024-06-18 22:56:26,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 22:56:26,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3583.77 | bwd_microstep: 1972.04 | bwd_inner_microstep: 1966.33 | bwd_allreduce_microstep: 5.59 | step_microstep: 61.78 [2024-06-18 22:56:26,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7176.39 | bwd: 3966.31 | bwd_inner: 3955.60 | bwd_allreduce: 10.47 | step: 61.86 38%|███▊ | 225/600 [41:19<1:08:53, 11.02s/it] {'loss': 1.0034, 'learning_rate': 7.190238384283412e-05, 'epoch': 2.25} 38%|███▊ | 225/600 [41:19<1:08:53, 11.02s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0275, device='cuda:0', grad_fn=) tensor(0.7072, device='cuda:0', grad_fn=) tensor(0.9955, device='cuda:0', grad_fn=) [2024-06-18 22:56:32,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3577.89 | bwd_microstep: 1960.74 | bwd_inner_microstep: 1955.70 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1158, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0627, device='cuda:0', grad_fn=) [2024-06-18 22:56:37,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 22:56:37,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3557.04 | bwd_microstep: 1926.28 | bwd_inner_microstep: 1920.66 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.60 [2024-06-18 22:56:37,885] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7134.91 | bwd: 3887.02 | bwd_inner: 3876.44 | bwd_allreduce: 10.31 | step: 61.68 38%|███▊ | 226/600 [41:31<1:09:12, 11.10s/it] {'loss': 1.0291, 'learning_rate': 7.165944220310767e-05, 'epoch': 2.26} 38%|███▊ | 226/600 [41:31<1:09:12, 11.10s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(1.0253, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9928, device='cuda:0', grad_fn=) [2024-06-18 22:56:40,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1693.81 | bwd_microstep: 828.68 | bwd_inner_microstep: 823.61 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5151, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5336, device='cuda:0', grad_fn=) [2024-06-18 22:56:46,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.94 [2024-06-18 22:56:46,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3546.20 | bwd_microstep: 1890.82 | bwd_inner_microstep: 1885.17 | bwd_allreduce_microstep: 5.53 | step_microstep: 62.99 [2024-06-18 22:56:46,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5239.99 | bwd: 2719.49 | bwd_inner: 2708.84 | bwd_allreduce: 10.40 | step: 63.07 38%|███▊ | 227/600 [41:39<1:03:35, 10.23s/it] {'loss': 0.7632, 'learning_rate': 7.141586946075183e-05, 'epoch': 2.27} 38%|███▊ | 227/600 [41:39<1:03:35, 10.23s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0087, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0894, device='cuda:0', grad_fn=) [2024-06-18 22:56:51,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3477.42 | bwd_microstep: 1743.45 | bwd_inner_microstep: 1738.40 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8368, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8231, device='cuda:0', grad_fn=) [2024-06-18 22:56:57,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 22:56:57,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.86 | bwd_microstep: 1922.63 | bwd_inner_microstep: 1917.07 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.27 [2024-06-18 22:56:57,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7037.26 | bwd: 3666.07 | bwd_inner: 3655.53 | bwd_allreduce: 10.35 | step: 61.35 38%|███▊ | 228/600 [41:50<1:04:46, 10.45s/it] {'loss': 0.4562, 'learning_rate': 7.117167271287453e-05, 'epoch': 2.28} 38%|███▊ | 228/600 [41:50<1:04:46, 10.45s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9268, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9041, device='cuda:0', grad_fn=) [2024-06-18 22:57:02,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3553.96 | bwd_microstep: 1914.77 | bwd_inner_microstep: 1909.45 | bwd_allreduce_microstep: 5.20 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9707, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9321, device='cuda:0', grad_fn=) [2024-06-18 22:57:08,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 22:57:08,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.56 | bwd_microstep: 1906.02 | bwd_inner_microstep: 1900.46 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.10 [2024-06-18 22:57:08,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7110.51 | bwd: 3820.78 | bwd_inner: 3809.97 | bwd_allreduce: 10.58 | step: 61.18 38%|███▊ | 229/600 [42:01<1:05:59, 10.67s/it] {'loss': 0.9181, 'learning_rate': 7.092685907476558e-05, 'epoch': 2.29} 38%|███▊ | 229/600 [42:01<1:05:59, 10.67s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0470, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1238, device='cuda:0', grad_fn=) [2024-06-18 22:57:13,623] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3503.42 | bwd_microstep: 1801.71 | bwd_inner_microstep: 1796.54 | bwd_allreduce_microstep: 5.06 | step_microstep: 0.09 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1037, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1744, device='cuda:0', grad_fn=) [2024-06-18 22:57:19,072] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 22:57:19,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3497.04 | bwd_microstep: 1802.18 | bwd_inner_microstep: 1796.65 | bwd_allreduce_microstep: 5.35 | step_microstep: 61.55 [2024-06-18 22:57:19,073] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7000.43 | bwd: 3603.88 | bwd_inner: 3593.24 | bwd_allreduce: 10.41 | step: 61.65 38%|███▊ | 230/600 [42:12<1:06:08, 10.72s/it] {'loss': 0.1491, 'learning_rate': 7.068143567968957e-05, 'epoch': 2.3} 38%|███▊ | 230/600 [42:12<1:06:08, 10.72s/it]warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9043, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.8842, device='cuda:0', grad_fn=) [2024-06-18 22:57:23,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3142.44 | bwd_microstep: 1667.56 | bwd_inner_microstep: 1662.46 | bwd_allreduce_microstep: 4.92 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.0571, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1329, device='cuda:0', grad_fn=) [2024-06-18 22:57:27,794] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 22:57:27,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2385.05 | bwd_microstep: 1294.60 | bwd_inner_microstep: 1289.05 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.78 [2024-06-18 22:57:27,795] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5527.47 | bwd: 2962.16 | bwd_inner: 2951.56 | bwd_allreduce: 10.38 | step: 61.87 38%|███▊ | 231/600 [42:20<1:02:15, 10.12s/it] {'loss': 0.5085, 'learning_rate': 7.043540967867782e-05, 'epoch': 2.31} 38%|███▊ | 231/600 [42:20<1:02:15, 10.12s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0165, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0963, device='cuda:0', grad_fn=) [2024-06-18 22:57:33,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3472.05 | bwd_microstep: 1738.82 | bwd_inner_microstep: 1733.60 | bwd_allreduce_microstep: 5.04 | step_microstep: 0.14 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9231, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8895, device='cuda:0', grad_fn=) [2024-06-18 22:57:38,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:57:38,725] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3554.07 | bwd_microstep: 1908.89 | bwd_inner_microstep: 1903.46 | bwd_allreduce_microstep: 5.32 | step_microstep: 61.29 [2024-06-18 22:57:38,726] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7026.10 | bwd: 3647.70 | bwd_inner: 3637.11 | bwd_allreduce: 10.35 | step: 61.44 39%|███▊ | 232/600 [42:31<1:03:34, 10.37s/it] {'loss': 0.4929, 'learning_rate': 7.018878824032009e-05, 'epoch': 2.32} 39%|███▊ | 232/600 [42:31<1:03:34, 10.37s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0159, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9843, device='cuda:0', grad_fn=) [2024-06-18 22:57:43,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2681.08 | bwd_microstep: 1660.56 | bwd_inner_microstep: 1655.45 | bwd_allreduce_microstep: 5.01 | step_microstep: 0.09 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9666, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9399, device='cuda:0', grad_fn=) [2024-06-18 22:57:48,780] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 22:57:48,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.79 | bwd_microstep: 1908.97 | bwd_inner_microstep: 1903.37 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.18 [2024-06-18 22:57:48,781] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6230.86 | bwd: 3569.53 | bwd_inner: 3558.86 | bwd_allreduce: 10.43 | step: 61.27 39%|███▉ | 233/600 [42:41<1:02:50, 10.27s/it] {'loss': 0.9621, 'learning_rate': 6.994157855055576e-05, 'epoch': 2.33} 39%|███▉ | 233/600 [42:41<1:02:50, 10.27s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0198, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0989, device='cuda:0', grad_fn=) [2024-06-18 22:57:54,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3477.22 | bwd_microstep: 1740.84 | bwd_inner_microstep: 1735.78 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7338, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7304, device='cuda:0', grad_fn=) [2024-06-18 22:57:59,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:57:59,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3557.20 | bwd_microstep: 1922.40 | bwd_inner_microstep: 1916.89 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.07 [2024-06-18 22:57:59,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7034.38 | bwd: 3663.24 | bwd_inner: 3652.71 | bwd_allreduce: 10.31 | step: 61.15 39%|███▉ | 234/600 [42:52<1:03:54, 10.48s/it] {'loss': 0.4147, 'learning_rate': 6.969378781246436e-05, 'epoch': 2.34} 39%|███▉ | 234/600 [42:52<1:03:54, 10.48s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0129, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9816, device='cuda:0', grad_fn=) [2024-06-18 22:58:05,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3590.04 | bwd_microstep: 1974.27 | bwd_inner_microstep: 1969.09 | bwd_allreduce_microstep: 5.06 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9637, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9258, device='cuda:0', grad_fn=) [2024-06-18 22:58:11,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 22:58:11,066] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3563.55 | bwd_microstep: 1931.85 | bwd_inner_microstep: 1926.25 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.68 [2024-06-18 22:58:11,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7153.56 | bwd: 3906.11 | bwd_inner: 3895.40 | bwd_allreduce: 10.46 | step: 61.76 39%|███▉ | 235/600 [43:04<1:05:17, 10.73s/it] {'loss': 0.9537, 'learning_rate': 6.944542324605578e-05, 'epoch': 2.35} 39%|███▉ | 235/600 [43:04<1:05:17, 10.73s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0141, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0941, device='cuda:0', grad_fn=) [2024-06-18 22:58:16,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3496.63 | bwd_microstep: 1803.12 | bwd_inner_microstep: 1798.13 | bwd_allreduce_microstep: 4.92 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6863, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.6761, device='cuda:0', grad_fn=) [2024-06-18 22:58:22,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 22:58:22,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.18 | bwd_microstep: 1907.06 | bwd_inner_microstep: 1901.37 | bwd_allreduce_microstep: 5.59 | step_microstep: 62.00 [2024-06-18 22:58:22,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7055.80 | bwd: 3710.18 | bwd_inner: 3699.50 | bwd_allreduce: 10.51 | step: 62.09 39%|███▉ | 236/600 [43:15<1:05:43, 10.84s/it] {'loss': 0.3851, 'learning_rate': 6.919649208805981e-05, 'epoch': 2.36} 39%|███▉ | 236/600 [43:15<1:05:43, 10.84s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7005, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7004, device='cuda:0', grad_fn=) [2024-06-18 22:58:27,683] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3553.91 | bwd_microstep: 1890.15 | bwd_inner_microstep: 1885.13 | bwd_allreduce_microstep: 4.93 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5469, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5622, device='cuda:0', grad_fn=) [2024-06-18 22:58:33,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:58:33,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3554.00 | bwd_microstep: 1898.81 | bwd_inner_microstep: 1893.27 | bwd_allreduce_microstep: 5.44 | step_microstep: 60.96 [2024-06-18 22:58:33,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7107.87 | bwd: 3788.95 | bwd_inner: 3778.39 | bwd_allreduce: 10.38 | step: 61.04 40%|███▉ | 237/600 [43:26<1:06:08, 10.93s/it] {'loss': 0.6313, 'learning_rate': 6.894700159171534e-05, 'epoch': 2.37} 40%|███▉ | 237/600 [43:26<1:06:08, 10.93s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8880, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8692, device='cuda:0', grad_fn=) [2024-06-18 22:58:37,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2694.29 | bwd_microstep: 1660.94 | bwd_inner_microstep: 1655.92 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) tensor(1.2800, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.2220, device='cuda:0', grad_fn=) [2024-06-18 22:58:41,574] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:58:41,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2269.90 | bwd_microstep: 1409.45 | bwd_inner_microstep: 1403.77 | bwd_allreduce_microstep: 5.57 | step_microstep: 61.58 [2024-06-18 22:58:41,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4964.17 | bwd: 3070.38 | bwd_inner: 3059.71 | bwd_allreduce: 10.49 | step: 61.67 40%|███▉ | 238/600 [43:34<1:01:09, 10.14s/it] {'loss': 1.0456, 'learning_rate': 6.869695902655897e-05, 'epoch': 2.38} 40%|███▉ | 238/600 [43:34<1:01:09, 10.14s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8327, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8194, device='cuda:0', grad_fn=) [2024-06-18 22:58:47,114] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3545.02 | bwd_microstep: 1893.98 | bwd_inner_microstep: 1888.88 | bwd_allreduce_microstep: 4.92 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8259, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.8018, device='cuda:0', grad_fn=) [2024-06-18 22:58:52,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 22:58:52,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2955.52 | bwd_microstep: 1866.96 | bwd_inner_microstep: 1861.41 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.36 [2024-06-18 22:58:52,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6500.53 | bwd: 3760.93 | bwd_inner: 3750.34 | bwd_allreduce: 10.36 | step: 61.45 40%|███▉ | 239/600 [43:45<1:01:41, 10.25s/it] {'loss': 0.8106, 'learning_rate': 6.844637167821326e-05, 'epoch': 2.39} 40%|███▉ | 239/600 [43:45<1:01:41, 10.25s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.0148, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0948, device='cuda:0', grad_fn=) [2024-06-18 22:58:56,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2845.17 | bwd_microstep: 1631.12 | bwd_inner_microstep: 1626.00 | bwd_allreduce_microstep: 5.01 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2448, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.1903, device='cuda:0', grad_fn=) [2024-06-18 22:59:02,362] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.96 [2024-06-18 22:59:02,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3575.84 | bwd_microstep: 1953.31 | bwd_inner_microstep: 1947.68 | bwd_allreduce_microstep: 5.50 | step_microstep: 62.06 [2024-06-18 22:59:02,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6420.99 | bwd: 3584.43 | bwd_inner: 3573.71 | bwd_allreduce: 10.52 | step: 62.15 40%|████ | 240/600 [43:55<1:01:31, 10.26s/it] {'loss': 0.6425, 'learning_rate': 6.819524684817438e-05, 'epoch': 2.4} 40%|████ | 240/600 [43:55<1:01:31, 10.26s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3446, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.3801, device='cuda:0', grad_fn=) [2024-06-18 22:59:07,926] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3553.69 | bwd_microstep: 1907.88 | bwd_inner_microstep: 1902.76 | bwd_allreduce_microstep: 4.96 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8222, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.7984, device='cuda:0', grad_fn=) [2024-06-18 22:59:13,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.81 [2024-06-18 22:59:13,593] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3562.72 | bwd_microstep: 1939.02 | bwd_inner_microstep: 1933.46 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.35 [2024-06-18 22:59:13,594] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7116.40 | bwd: 3846.90 | bwd_inner: 3836.31 | bwd_allreduce: 10.32 | step: 61.43 40%|████ | 241/600 [44:06<1:03:06, 10.55s/it] {'loss': 0.5893, 'learning_rate': 6.794359185359938e-05, 'epoch': 2.41} 40%|████ | 241/600 [44:06<1:03:06, 10.55s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0052, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0862, device='cuda:0', grad_fn=) [2024-06-18 22:59:18,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2840.57 | bwd_microstep: 1629.47 | bwd_inner_microstep: 1624.39 | bwd_allreduce_microstep: 4.97 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.6630, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6667, device='cuda:0', grad_fn=) [2024-06-18 22:59:22,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 22:59:22,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2744.75 | bwd_microstep: 1797.57 | bwd_inner_microstep: 1791.95 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.49 [2024-06-18 22:59:22,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5585.31 | bwd: 3427.03 | bwd_inner: 3416.40 | bwd_allreduce: 10.38 | step: 61.57 40%|████ | 242/600 [44:16<1:00:37, 10.16s/it] {'loss': 0.3764, 'learning_rate': 6.769141402709305e-05, 'epoch': 2.42} 40%|████ | 242/600 [44:16<1:00:37, 10.16s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7738, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.7667, device='cuda:0', grad_fn=) [2024-06-18 22:59:28,392] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3547.37 | bwd_microstep: 1894.43 | bwd_inner_microstep: 1889.31 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0218, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.9784, device='cuda:0', grad_fn=) [2024-06-18 22:59:34,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:59:34,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3587.38 | bwd_microstep: 1980.57 | bwd_inner_microstep: 1975.00 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.39 [2024-06-18 22:59:34,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7134.73 | bwd: 3874.99 | bwd_inner: 3864.40 | bwd_allreduce: 10.32 | step: 61.48 40%|████ | 243/600 [44:27<1:02:27, 10.50s/it] {'loss': 0.8726, 'learning_rate': 6.743872071649411e-05, 'epoch': 2.43} 40%|████ | 243/600 [44:27<1:02:27, 10.50s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8289, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8160, device='cuda:0', grad_fn=) [2024-06-18 22:59:38,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2655.30 | bwd_microstep: 1614.74 | bwd_inner_microstep: 1609.69 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0064, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0872, device='cuda:0', grad_fn=) [2024-06-18 22:59:43,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 22:59:43,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3443.57 | bwd_microstep: 1692.87 | bwd_inner_microstep: 1687.24 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.00 [2024-06-18 22:59:43,778] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6098.85 | bwd: 3307.60 | bwd_inner: 3297.01 | bwd_allreduce: 10.32 | step: 61.08 41%|████ | 244/600 [44:36<1:00:46, 10.24s/it] {'loss': 0.4516, 'learning_rate': 6.718551928466132e-05, 'epoch': 2.44} 41%|████ | 244/600 [44:36<1:00:46, 10.24s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0390, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1166, device='cuda:0', grad_fn=) [2024-06-18 22:59:48,399] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2865.95 | bwd_microstep: 1664.48 | bwd_inner_microstep: 1659.33 | bwd_allreduce_microstep: 5.04 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1097, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1802, device='cuda:0', grad_fn=) [2024-06-18 22:59:53,775] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 22:59:53,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3477.46 | bwd_microstep: 1745.91 | bwd_inner_microstep: 1740.36 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.72 [2024-06-18 22:59:53,776] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6343.39 | bwd: 3410.38 | bwd_inner: 3399.73 | bwd_allreduce: 10.39 | step: 61.80 41%|████ | 245/600 [44:46<1:00:09, 10.17s/it] {'loss': 0.1484, 'learning_rate': 6.693181710925878e-05, 'epoch': 2.45} 41%|████ | 245/600 [44:46<1:00:09, 10.17s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8801, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.8505, device='cuda:0', grad_fn=) [2024-06-18 22:59:59,390] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3563.70 | bwd_microstep: 1941.99 | bwd_inner_microstep: 1936.88 | bwd_allreduce_microstep: 4.92 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0627, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.1383, device='cuda:0', grad_fn=) [2024-06-18 23:00:04,674] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:00:04,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3443.51 | bwd_microstep: 1693.11 | bwd_inner_microstep: 1687.56 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.30 [2024-06-18 23:00:04,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7007.18 | bwd: 3635.09 | bwd_inner: 3624.49 | bwd_allreduce: 10.36 | step: 61.37 41%|████ | 246/600 [44:57<1:01:17, 10.39s/it] {'loss': 0.4944, 'learning_rate': 6.667762158254104e-05, 'epoch': 2.46} 41%|████ | 246/600 [44:57<1:01:17, 10.39s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0233, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.1029, device='cuda:0', grad_fn=) [2024-06-18 23:00:09,991] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3476.26 | bwd_microstep: 1745.14 | bwd_inner_microstep: 1740.04 | bwd_allreduce_microstep: 4.99 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9970, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9558, device='cuda:0', grad_fn=) [2024-06-18 23:00:15,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:00:15,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3586.28 | bwd_microstep: 1977.05 | bwd_inner_microstep: 1971.29 | bwd_allreduce_microstep: 5.64 | step_microstep: 62.46 [2024-06-18 23:00:15,724] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7062.50 | bwd: 3722.18 | bwd_inner: 3711.34 | bwd_allreduce: 10.64 | step: 62.54 41%|████ | 247/600 [45:08<1:02:16, 10.59s/it] {'loss': 0.5293, 'learning_rate': 6.642294011113764e-05, 'epoch': 2.47} 41%|████ | 247/600 [45:08<1:02:16, 10.59s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9088, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8879, device='cuda:0', grad_fn=) [2024-06-18 23:00:21,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.91 | bwd_microstep: 1907.16 | bwd_inner_microstep: 1902.09 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7551, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.7499, device='cuda:0', grad_fn=) [2024-06-18 23:00:26,977] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:00:26,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3581.78 | bwd_microstep: 1951.73 | bwd_inner_microstep: 1946.11 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.14 [2024-06-18 23:00:26,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7131.68 | bwd: 3858.89 | bwd_inner: 3848.29 | bwd_allreduce: 10.33 | step: 61.22 41%|████▏ | 248/600 [45:20<1:03:16, 10.79s/it] {'loss': 0.8189, 'learning_rate': 6.616778011583743e-05, 'epoch': 2.48} 41%|████▏ | 248/600 [45:20<1:03:16, 10.79s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.0177, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0974, device='cuda:0', grad_fn=) [2024-06-18 23:00:29,559] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1697.73 | bwd_microstep: 821.82 | bwd_inner_microstep: 816.86 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6204, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.6282, device='cuda:0', grad_fn=) [2024-06-18 23:00:35,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:00:35,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3544.43 | bwd_microstep: 1893.81 | bwd_inner_microstep: 1888.31 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.08 [2024-06-18 23:00:35,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5242.14 | bwd: 2715.63 | bwd_inner: 2705.18 | bwd_allreduce: 10.26 | step: 61.16 42%|████▏ | 249/600 [45:28<58:31, 10.00s/it] {'loss': 0.3628, 'learning_rate': 6.59121490313722e-05, 'epoch': 2.49} 42%|████▏ | 249/600 [45:28<58:31, 10.00s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5719, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.5850, device='cuda:0', grad_fn=) [2024-06-18 23:00:40,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.55 | bwd_microstep: 1890.58 | bwd_inner_microstep: 1885.41 | bwd_allreduce_microstep: 5.05 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1597, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.1022, device='cuda:0', grad_fn=) [2024-06-18 23:00:46,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.96 [2024-06-18 23:00:46,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3587.32 | bwd_microstep: 1982.05 | bwd_inner_microstep: 1976.47 | bwd_allreduce_microstep: 5.47 | step_microstep: 62.39 [2024-06-18 23:00:46,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7136.86 | bwd: 3872.62 | bwd_inner: 3861.89 | bwd_allreduce: 10.52 | step: 62.48 42%|████▏ | 250/600 [45:39<1:00:35, 10.39s/it] {'loss': 0.8436, 'learning_rate': 6.565605430620013e-05, 'epoch': 2.5} 42%|████▏ | 250/600 [45:39<1:00:35, 10.39s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8311, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8180, device='cuda:0', grad_fn=) [2024-06-18 23:00:51,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.77 | bwd_microstep: 1890.75 | bwd_inner_microstep: 1885.72 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2143, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(1.1517, device='cuda:0', grad_fn=) [2024-06-18 23:00:57,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:00:57,741] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3595.84 | bwd_microstep: 2001.62 | bwd_inner_microstep: 1996.06 | bwd_allreduce_microstep: 5.48 | step_microstep: 61.44 [2024-06-18 23:00:57,742] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7145.59 | bwd: 3892.36 | bwd_inner: 3881.81 | bwd_allreduce: 10.33 | step: 61.52 42%|████▏ | 251/600 [45:50<1:02:01, 10.66s/it] {'loss': 0.9848, 'learning_rate': 6.539950340228877e-05, 'epoch': 2.51} 42%|████▏ | 251/600 [45:50<1:02:01, 10.66s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0375, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1153, device='cuda:0', grad_fn=) [2024-06-18 23:01:02,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3415.11 | bwd_microstep: 1639.39 | bwd_inner_microstep: 1634.33 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6895, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6905, device='cuda:0', grad_fn=) [2024-06-18 23:01:07,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:01:07,598] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2871.37 | bwd_microstep: 1685.96 | bwd_inner_microstep: 1680.40 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.08 [2024-06-18 23:01:07,599] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6286.47 | bwd: 3325.34 | bwd_inner: 3314.78 | bwd_allreduce: 10.32 | step: 61.16 42%|████▏ | 252/600 [46:00<1:00:26, 10.42s/it] {'loss': 0.4029, 'learning_rate': 6.514250379489753e-05, 'epoch': 2.52} 42%|████▏ | 252/600 [46:00<1:00:26, 10.42s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.1432, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.2103, device='cuda:0', grad_fn=) [2024-06-18 23:01:09,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1452.41 | bwd_microstep: 507.07 | bwd_inner_microstep: 502.13 | bwd_allreduce_microstep: 4.83 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8220, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7986, device='cuda:0', grad_fn=) [2024-06-18 23:01:15,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:01:15,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3553.16 | bwd_microstep: 1935.31 | bwd_inner_microstep: 1929.48 | bwd_allreduce_microstep: 5.72 | step_microstep: 61.58 [2024-06-18 23:01:15,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5005.54 | bwd: 2442.38 | bwd_inner: 2431.63 | bwd_allreduce: 10.56 | step: 61.66 42%|████▏ | 253/600 [46:08<55:29, 9.59s/it] {'loss': 0.5044, 'learning_rate': 6.488506297236003e-05, 'epoch': 2.53} 42%|████▏ | 253/600 [46:08<55:29, 9.59s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0114, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.0922, device='cuda:0', grad_fn=) [2024-06-18 23:01:20,490] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3443.94 | bwd_microstep: 1693.03 | bwd_inner_microstep: 1688.13 | bwd_allreduce_microstep: 4.82 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0255, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1044, device='cuda:0', grad_fn=) [2024-06-18 23:01:25,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:01:25,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3494.04 | bwd_microstep: 1807.20 | bwd_inner_microstep: 1801.66 | bwd_allreduce_microstep: 5.43 | step_microstep: 62.40 [2024-06-18 23:01:25,943] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6937.95 | bwd: 3500.23 | bwd_inner: 3489.79 | bwd_allreduce: 10.27 | step: 62.48 42%|████▏ | 254/600 [46:19<57:12, 9.92s/it] {'loss': 0.0983, 'learning_rate': 6.462718843586571e-05, 'epoch': 2.54} 42%|████▏ | 254/600 [46:19<57:12, 9.92s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0243, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.1038, device='cuda:0', grad_fn=) [2024-06-18 23:01:31,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3478.42 | bwd_microstep: 1746.53 | bwd_inner_microstep: 1741.40 | bwd_allreduce_microstep: 5.01 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8730, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8445, device='cuda:0', grad_fn=) [2024-06-18 23:01:36,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.95 [2024-06-18 23:01:36,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.83 | bwd_microstep: 1929.47 | bwd_inner_microstep: 1923.89 | bwd_allreduce_microstep: 5.46 | step_microstep: 61.88 [2024-06-18 23:01:36,918] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7038.24 | bwd: 3675.99 | bwd_inner: 3665.30 | bwd_allreduce: 10.49 | step: 61.97 42%|████▎ | 255/600 [46:30<58:51, 10.24s/it] {'loss': 0.4742, 'learning_rate': 6.436888769924142e-05, 'epoch': 2.55} 42%|████▎ | 255/600 [46:30<58:51, 10.24s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0857, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0356, device='cuda:0', grad_fn=) [2024-06-18 23:01:42,577] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3579.08 | bwd_microstep: 1971.18 | bwd_inner_microstep: 1966.13 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6984, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.6874, device='cuda:0', grad_fn=) [2024-06-18 23:01:47,556] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:01:47,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2952.75 | bwd_microstep: 1861.41 | bwd_inner_microstep: 1855.75 | bwd_allreduce_microstep: 5.48 | step_microstep: 61.41 [2024-06-18 23:01:47,557] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6531.81 | bwd: 3832.59 | bwd_inner: 3821.97 | bwd_allreduce: 10.33 | step: 61.49 43%|████▎ | 256/600 [46:40<59:22, 10.36s/it] {'loss': 0.8615, 'learning_rate': 6.411016828873239e-05, 'epoch': 2.56} 43%|████▎ | 256/600 [46:40<59:22, 10.36s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2016, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.1399, device='cuda:0', grad_fn=) [2024-06-18 23:01:53,115] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.64 | bwd_microstep: 1904.31 | bwd_inner_microstep: 1899.38 | bwd_allreduce_microstep: 4.82 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8990, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.8676, device='cuda:0', grad_fn=) [2024-06-18 23:01:58,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:01:58,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3589.41 | bwd_microstep: 1979.31 | bwd_inner_microstep: 1973.76 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.43 [2024-06-18 23:01:58,851] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7139.04 | bwd: 3883.61 | bwd_inner: 3873.15 | bwd_allreduce: 10.28 | step: 61.51 43%|████▎ | 257/600 [46:52<1:00:48, 10.64s/it] {'loss': 1.0037, 'learning_rate': 6.385103774278303e-05, 'epoch': 2.57} 43%|████▎ | 257/600 [46:52<1:00:48, 10.64s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0158, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0952, device='cuda:0', grad_fn=) [2024-06-18 23:02:04,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3493.48 | bwd_microstep: 1808.95 | bwd_inner_microstep: 1803.95 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6407, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6466, device='cuda:0', grad_fn=) [2024-06-18 23:02:09,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:02:09,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3545.36 | bwd_microstep: 1891.15 | bwd_inner_microstep: 1885.61 | bwd_allreduce_microstep: 5.36 | step_microstep: 61.30 [2024-06-18 23:02:09,840] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7038.78 | bwd: 3700.09 | bwd_inner: 3689.62 | bwd_allreduce: 10.24 | step: 61.38 43%|████▎ | 258/600 [47:02<1:01:14, 10.74s/it] {'loss': 0.3709, 'learning_rate': 6.359150361181715e-05, 'epoch': 2.58} 43%|████▎ | 258/600 [47:02<1:01:14, 10.74s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8129, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8016, device='cuda:0', grad_fn=) [2024-06-18 23:02:15,418] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.96 | bwd_microstep: 1918.91 | bwd_inner_microstep: 1913.72 | bwd_allreduce_microstep: 5.08 | step_microstep: 0.10 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6808, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6830, device='cuda:0', grad_fn=) [2024-06-18 23:02:20,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:02:20,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2862.15 | bwd_microstep: 1669.78 | bwd_inner_microstep: 1664.23 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.99 [2024-06-18 23:02:20,104] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6419.10 | bwd: 3588.70 | bwd_inner: 3577.97 | bwd_allreduce: 10.53 | step: 62.09 43%|████▎ | 259/600 [47:13<1:00:14, 10.60s/it] {'loss': 0.7423, 'learning_rate': 6.333157345801809e-05, 'epoch': 2.59} 43%|████▎ | 259/600 [47:13<1:00:14, 10.60s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(1.1154, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0738, device='cuda:0', grad_fn=) [2024-06-18 23:02:24,972] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2940.52 | bwd_microstep: 1827.93 | bwd_inner_microstep: 1823.00 | bwd_allreduce_microstep: 4.83 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.0096, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0897, device='cuda:0', grad_fn=) [2024-06-18 23:02:29,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 23:02:29,459] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2679.66 | bwd_microstep: 1656.05 | bwd_inner_microstep: 1650.44 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.62 [2024-06-18 23:02:29,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5620.17 | bwd: 3483.97 | bwd_inner: 3473.48 | bwd_allreduce: 10.28 | step: 61.71 43%|████▎ | 260/600 [47:22<57:56, 10.23s/it] {'loss': 0.5818, 'learning_rate': 6.307125485510828e-05, 'epoch': 2.6} 43%|████▎ | 260/600 [47:22<57:56, 10.23s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0091, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0896, device='cuda:0', grad_fn=) [2024-06-18 23:02:34,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3497.49 | bwd_microstep: 1807.22 | bwd_inner_microstep: 1801.93 | bwd_allreduce_microstep: 5.18 | step_microstep: 0.14 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.7987, device='cuda:0', grad_fn=) tensor(0.7039, device='cuda:0', grad_fn=) tensor(0.7892, device='cuda:0', grad_fn=) [2024-06-18 23:02:39,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:02:39,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2650.91 | bwd_microstep: 1617.55 | bwd_inner_microstep: 1611.99 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.22 [2024-06-18 23:02:39,282] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6148.38 | bwd: 3424.76 | bwd_inner: 3413.97 | bwd_allreduce: 10.61 | step: 61.37 44%|████▎ | 261/600 [47:32<57:05, 10.11s/it] {'loss': 0.4394, 'learning_rate': 6.281055538812861e-05, 'epoch': 2.61} 44%|████▎ | 261/600 [47:32<57:05, 10.11s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8370, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8233, device='cuda:0', grad_fn=) [2024-06-18 23:02:43,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2700.92 | bwd_microstep: 1726.66 | bwd_inner_microstep: 1721.67 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9776, device='cuda:0', grad_fn=) tensor(0.5814, device='cuda:0', grad_fn=) tensor(0.9380, device='cuda:0', grad_fn=) [2024-06-18 23:02:49,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:02:49,452] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.54 | bwd_microstep: 1924.05 | bwd_inner_microstep: 1918.53 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.31 [2024-06-18 23:02:49,452] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6257.42 | bwd: 3650.71 | bwd_inner: 3640.21 | bwd_allreduce: 10.30 | step: 61.39 44%|████▎ | 262/600 [47:42<57:02, 10.12s/it] {'loss': 0.8806, 'learning_rate': 6.254948265321744e-05, 'epoch': 2.62} 44%|████▎ | 262/600 [47:42<57:02, 10.12s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8273, device='cuda:0', grad_fn=) tensor(0.5814, device='cuda:0', grad_fn=) tensor(0.8027, device='cuda:0', grad_fn=) [2024-06-18 23:02:55,103] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3577.32 | bwd_microstep: 1965.23 | bwd_inner_microstep: 1960.11 | bwd_allreduce_microstep: 5.00 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.4388, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.4649, device='cuda:0', grad_fn=) [2024-06-18 23:02:59,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:02:59,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2844.52 | bwd_microstep: 1644.73 | bwd_inner_microstep: 1639.27 | bwd_allreduce_microstep: 5.34 | step_microstep: 61.20 [2024-06-18 23:02:59,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6421.82 | bwd: 3609.96 | bwd_inner: 3599.41 | bwd_allreduce: 10.35 | step: 61.28 44%|████▍ | 263/600 [47:52<57:08, 10.17s/it] {'loss': 0.6338, 'learning_rate': 6.228804425738917e-05, 'epoch': 2.63} 44%|████▍ | 263/600 [47:52<57:08, 10.17s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2102, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.1592, device='cuda:0', grad_fn=) [2024-06-18 23:03:05,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3543.62 | bwd_microstep: 1885.24 | bwd_inner_microstep: 1880.13 | bwd_allreduce_microstep: 5.00 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6933, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.6824, device='cuda:0', grad_fn=) [2024-06-18 23:03:10,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:03:10,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3568.91 | bwd_microstep: 1935.68 | bwd_inner_microstep: 1930.00 | bwd_allreduce_microstep: 5.56 | step_microstep: 62.02 [2024-06-18 23:03:10,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7112.52 | bwd: 3820.92 | bwd_inner: 3810.15 | bwd_allreduce: 10.57 | step: 62.10 44%|████▍ | 264/600 [48:04<58:41, 10.48s/it] {'loss': 0.9208, 'learning_rate': 6.202624781831268e-05, 'epoch': 2.64} 44%|████▍ | 264/600 [48:04<58:41, 10.48s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0226, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1019, device='cuda:0', grad_fn=) [2024-06-18 23:03:15,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2646.54 | bwd_microstep: 1604.53 | bwd_inner_microstep: 1599.63 | bwd_allreduce_microstep: 4.80 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1874, device='cuda:0', grad_fn=) tensor(0.5814, device='cuda:0', grad_fn=) tensor(1.1268, device='cuda:0', grad_fn=) [2024-06-18 23:03:21,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.95 [2024-06-18 23:03:21,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3638.03 | bwd_microstep: 2102.33 | bwd_inner_microstep: 2096.62 | bwd_allreduce_microstep: 5.60 | step_microstep: 62.33 [2024-06-18 23:03:21,197] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6284.55 | bwd: 3706.86 | bwd_inner: 3696.26 | bwd_allreduce: 10.41 | step: 62.41 44%|████▍ | 265/600 [48:14<58:08, 10.41s/it] {'loss': 0.6143, 'learning_rate': 6.176410096408938e-05, 'epoch': 2.65} 44%|████▍ | 265/600 [48:14<58:08, 10.41s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3883, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.4190, device='cuda:0', grad_fn=) [2024-06-18 23:03:26,848] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3580.13 | bwd_microstep: 1963.49 | bwd_inner_microstep: 1958.54 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9472, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9109, device='cuda:0', grad_fn=) [2024-06-18 23:03:32,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:03:32,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3553.43 | bwd_microstep: 1907.79 | bwd_inner_microstep: 1902.29 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.32 [2024-06-18 23:03:32,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7133.53 | bwd: 3871.27 | bwd_inner: 3860.85 | bwd_allreduce: 10.22 | step: 61.39 44%|████▍ | 266/600 [48:25<59:24, 10.67s/it] {'loss': 0.665, 'learning_rate': 6.150161133303089e-05, 'epoch': 2.66} 44%|████▍ | 266/600 [48:25<59:24, 10.67s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0243, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1034, device='cuda:0', grad_fn=) [2024-06-18 23:03:37,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3492.55 | bwd_microstep: 1802.97 | bwd_inner_microstep: 1797.94 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.0616, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1369, device='cuda:0', grad_fn=) [2024-06-18 23:03:42,262] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:03:42,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2650.93 | bwd_microstep: 1606.47 | bwd_inner_microstep: 1600.95 | bwd_allreduce_microstep: 5.37 | step_microstep: 61.13 [2024-06-18 23:03:42,263] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6143.47 | bwd: 3409.44 | bwd_inner: 3398.97 | bwd_allreduce: 10.22 | step: 61.21 44%|████▍ | 267/600 [48:35<57:45, 10.41s/it] {'loss': 0.1202, 'learning_rate': 6.123878657343648e-05, 'epoch': 2.67} 44%|████▍ | 267/600 [48:35<57:45, 10.41s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6009, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6108, device='cuda:0', grad_fn=) [2024-06-18 23:03:47,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3550.65 | bwd_microstep: 1914.75 | bwd_inner_microstep: 1909.82 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9403, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9162, device='cuda:0', grad_fn=) [2024-06-18 23:03:53,437] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:03:53,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3550.70 | bwd_microstep: 1899.42 | bwd_inner_microstep: 1893.83 | bwd_allreduce_microstep: 5.41 | step_microstep: 61.15 [2024-06-18 23:03:53,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7101.31 | bwd: 3814.17 | bwd_inner: 3803.68 | bwd_allreduce: 10.28 | step: 61.23 45%|████▍ | 268/600 [48:46<58:51, 10.64s/it] {'loss': 0.7635, 'learning_rate': 6.0975634343370256e-05, 'epoch': 2.68} 45%|████▍ | 268/600 [48:46<58:51, 10.64s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7017, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.7011, device='cuda:0', grad_fn=) [2024-06-18 23:03:58,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3546.26 | bwd_microstep: 1896.70 | bwd_inner_microstep: 1891.71 | bwd_allreduce_microstep: 4.82 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0079, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0882, device='cuda:0', grad_fn=) [2024-06-18 23:04:04,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:04:04,442] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3501.88 | bwd_microstep: 1807.48 | bwd_inner_microstep: 1801.86 | bwd_allreduce_microstep: 5.51 | step_microstep: 61.27 [2024-06-18 23:04:04,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7048.10 | bwd: 3704.18 | bwd_inner: 3693.63 | bwd_allreduce: 10.30 | step: 61.35 45%|████▍ | 269/600 [48:57<59:17, 10.75s/it] {'loss': 0.3946, 'learning_rate': 6.071216231043799e-05, 'epoch': 2.69} 45%|████▍ | 269/600 [48:57<59:17, 10.75s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(1.0608, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(1.0135, device='cuda:0', grad_fn=) [2024-06-18 23:04:09,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2760.62 | bwd_microstep: 1838.83 | bwd_inner_microstep: 1833.78 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8982, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.8780, device='cuda:0', grad_fn=) [2024-06-18 23:04:14,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:04:14,841] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3575.09 | bwd_microstep: 1951.33 | bwd_inner_microstep: 1945.87 | bwd_allreduce_microstep: 5.35 | step_microstep: 61.58 [2024-06-18 23:04:14,842] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6335.69 | bwd: 3790.15 | bwd_inner: 3779.69 | bwd_allreduce: 10.26 | step: 61.66 45%|████▌ | 270/600 [49:07<58:32, 10.64s/it] {'loss': 0.9457, 'learning_rate': 6.044837815156377e-05, 'epoch': 2.7} 45%|████▌ | 270/600 [49:07<58:32, 10.64s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0873, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1600, device='cuda:0', grad_fn=) [2024-06-18 23:04:20,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3468.26 | bwd_microstep: 1724.73 | bwd_inner_microstep: 1719.63 | bwd_allreduce_microstep: 4.98 | step_microstep: 0.14 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8546, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8391, device='cuda:0', grad_fn=) [2024-06-18 23:04:25,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:04:25,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3553.04 | bwd_microstep: 1899.87 | bwd_inner_microstep: 1894.22 | bwd_allreduce_microstep: 5.49 | step_microstep: 61.06 [2024-06-18 23:04:25,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7021.28 | bwd: 3624.60 | bwd_inner: 3613.89 | bwd_allreduce: 10.47 | step: 61.20 45%|████▌ | 271/600 [49:18<58:46, 10.72s/it] {'loss': 0.4996, 'learning_rate': 6.018428955276617e-05, 'epoch': 2.71} 45%|████▌ | 271/600 [49:18<58:46, 10.72s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0269, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1053, device='cuda:0', grad_fn=) [2024-06-18 23:04:31,141] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3497.59 | bwd_microstep: 1809.40 | bwd_inner_microstep: 1804.40 | bwd_allreduce_microstep: 4.82 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0046, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0856, device='cuda:0', grad_fn=) [2024-06-18 23:04:36,517] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:04:36,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3476.91 | bwd_microstep: 1747.09 | bwd_inner_microstep: 1741.52 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.25 [2024-06-18 23:04:36,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6974.46 | bwd: 3556.49 | bwd_inner: 3546.01 | bwd_allreduce: 10.20 | step: 61.34 45%|████▌ | 272/600 [49:29<58:41, 10.74s/it] {'loss': 0.0955, 'learning_rate': 5.99199042089345e-05, 'epoch': 2.72} 45%|████▌ | 272/600 [49:29<58:41, 10.74s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0431, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(1.0091, device='cuda:0', grad_fn=) [2024-06-18 23:04:42,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3595.90 | bwd_microstep: 1985.39 | bwd_inner_microstep: 1980.45 | bwd_allreduce_microstep: 4.82 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1043, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.0524, device='cuda:0', grad_fn=) [2024-06-18 23:04:47,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:04:47,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2784.69 | bwd_microstep: 1876.50 | bwd_inner_microstep: 1871.05 | bwd_allreduce_microstep: 5.34 | step_microstep: 61.35 [2024-06-18 23:04:47,039] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6380.57 | bwd: 3861.89 | bwd_inner: 3851.52 | bwd_allreduce: 10.17 | step: 61.43 46%|████▌ | 273/600 [49:40<58:09, 10.67s/it] {'loss': 1.0307, 'learning_rate': 5.9655229823604406e-05, 'epoch': 2.73} 46%|████▌ | 273/600 [49:40<58:09, 10.67s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0250, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1036, device='cuda:0', grad_fn=) [2024-06-18 23:04:52,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3475.83 | bwd_microstep: 1725.46 | bwd_inner_microstep: 1720.31 | bwd_allreduce_microstep: 5.02 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.6898, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.6903, device='cuda:0', grad_fn=) [2024-06-18 23:04:57,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:04:57,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2940.60 | bwd_microstep: 1824.68 | bwd_inner_microstep: 1819.24 | bwd_allreduce_microstep: 5.33 | step_microstep: 62.08 [2024-06-18 23:04:57,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6416.42 | bwd: 3550.13 | bwd_inner: 3539.57 | bwd_allreduce: 10.36 | step: 62.17 46%|████▌ | 274/600 [49:50<57:14, 10.53s/it] {'loss': 0.397, 'learning_rate': 5.939027410873351e-05, 'epoch': 2.74} 46%|████▌ | 274/600 [49:50<57:14, 10.53s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1442, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0998, device='cuda:0', grad_fn=) [2024-06-18 23:05:01,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2746.63 | bwd_microstep: 1805.62 | bwd_inner_microstep: 1800.70 | bwd_allreduce_microstep: 4.78 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8421, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.8274, device='cuda:0', grad_fn=) [2024-06-18 23:05:06,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 23:05:06,866] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2947.29 | bwd_microstep: 1854.81 | bwd_inner_microstep: 1849.16 | bwd_allreduce_microstep: 5.51 | step_microstep: 62.22 [2024-06-18 23:05:06,867] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5693.90 | bwd: 3660.42 | bwd_inner: 3649.93 | bwd_allreduce: 10.30 | step: 62.31 46%|████▌ | 275/600 [50:00<55:34, 10.26s/it] {'loss': 0.9636, 'learning_rate': 5.912504478447669e-05, 'epoch': 2.75} 46%|████▌ | 275/600 [50:00<55:34, 10.26s/it]warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8796, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.8619, device='cuda:0', grad_fn=) [2024-06-18 23:05:12,092] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3246.10 | bwd_microstep: 1876.44 | bwd_inner_microstep: 1871.48 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0053, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0863, device='cuda:0', grad_fn=) [2024-06-18 23:05:17,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:05:17,290] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3411.48 | bwd_microstep: 1640.08 | bwd_inner_microstep: 1634.32 | bwd_allreduce_microstep: 5.65 | step_microstep: 61.33 [2024-06-18 23:05:17,291] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6657.56 | bwd: 3516.51 | bwd_inner: 3505.81 | bwd_allreduce: 10.50 | step: 61.41 46%|████▌ | 276/600 [50:10<55:39, 10.31s/it] {'loss': 0.4741, 'learning_rate': 5.885954957896115e-05, 'epoch': 2.76} 46%|████▌ | 276/600 [50:10<55:39, 10.31s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9242, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.8909, device='cuda:0', grad_fn=) [2024-06-18 23:05:22,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3560.29 | bwd_microstep: 1933.37 | bwd_inner_microstep: 1928.43 | bwd_allreduce_microstep: 4.83 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8868, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.8685, device='cuda:0', grad_fn=) [2024-06-18 23:05:28,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:05:28,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3574.60 | bwd_microstep: 1953.80 | bwd_inner_microstep: 1948.08 | bwd_allreduce_microstep: 5.61 | step_microstep: 62.85 [2024-06-18 23:05:28,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7134.85 | bwd: 3887.16 | bwd_inner: 3876.52 | bwd_allreduce: 10.45 | step: 62.94 46%|████▌ | 277/600 [50:21<57:04, 10.60s/it] {'loss': 0.8797, 'learning_rate': 5.859379622806115e-05, 'epoch': 2.77} 46%|████▌ | 277/600 [50:21<57:04, 10.60s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.6492, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.6538, device='cuda:0', grad_fn=) [2024-06-18 23:05:33,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2742.84 | bwd_microstep: 1802.34 | bwd_inner_microstep: 1797.36 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8917, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.8616, device='cuda:0', grad_fn=) [2024-06-18 23:05:38,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:05:38,888] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3560.39 | bwd_microstep: 1937.52 | bwd_inner_microstep: 1931.93 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.20 [2024-06-18 23:05:38,889] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6303.21 | bwd: 3739.85 | bwd_inner: 3729.39 | bwd_allreduce: 10.18 | step: 61.28 46%|████▋ | 278/600 [50:32<56:25, 10.51s/it] {'loss': 0.7577, 'learning_rate': 5.832779247517273e-05, 'epoch': 2.78} 46%|████▋ | 278/600 [50:32<56:25, 10.51s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6757, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6785, device='cuda:0', grad_fn=) [2024-06-18 23:05:44,546] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3585.31 | bwd_microstep: 1965.40 | bwd_inner_microstep: 1960.18 | bwd_allreduce_microstep: 5.10 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8262, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8024, device='cuda:0', grad_fn=) [2024-06-18 23:05:50,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:05:50,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3562.76 | bwd_microstep: 1932.53 | bwd_inner_microstep: 1927.11 | bwd_allreduce_microstep: 5.35 | step_microstep: 61.63 [2024-06-18 23:05:50,206] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7148.02 | bwd: 3897.93 | bwd_inner: 3887.29 | bwd_allreduce: 10.45 | step: 61.71 46%|████▋ | 279/600 [50:43<57:32, 10.76s/it] {'loss': 0.7404, 'learning_rate': 5.8061546070987994e-05, 'epoch': 2.79} 46%|████▋ | 279/600 [50:43<57:32, 10.76s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0910, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(1.0522, device='cuda:0', grad_fn=) [2024-06-18 23:05:55,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3562.49 | bwd_microstep: 1925.53 | bwd_inner_microstep: 1920.40 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7549, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7382, device='cuda:0', grad_fn=) [2024-06-18 23:06:01,463] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:06:01,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3564.06 | bwd_microstep: 1934.77 | bwd_inner_microstep: 1929.24 | bwd_allreduce_microstep: 5.37 | step_microstep: 61.17 [2024-06-18 23:06:01,464] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7126.53 | bwd: 3860.29 | bwd_inner: 3849.72 | bwd_allreduce: 10.33 | step: 61.25 47%|████▋ | 280/600 [50:54<58:09, 10.91s/it] {'loss': 0.8952, 'learning_rate': 5.779506477326933e-05, 'epoch': 2.8} 47%|████▋ | 280/600 [50:54<58:09, 10.91s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7925, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.7828, device='cuda:0', grad_fn=) [2024-06-18 23:06:07,109] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3578.79 | bwd_microstep: 1960.46 | bwd_inner_microstep: 1955.51 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6553, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.6483, device='cuda:0', grad_fn=) [2024-06-18 23:06:12,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:06:12,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.66 | bwd_microstep: 1930.55 | bwd_inner_microstep: 1924.99 | bwd_allreduce_microstep: 5.37 | step_microstep: 60.99 [2024-06-18 23:06:12,763] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7138.41 | bwd: 3891.01 | bwd_inner: 3880.55 | bwd_allreduce: 10.27 | step: 61.07 47%|████▋ | 281/600 [51:05<58:36, 11.02s/it] {'loss': 0.7155, 'learning_rate': 5.752835634662331e-05, 'epoch': 2.81} 47%|████▋ | 281/600 [51:05<58:36, 11.02s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0023, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.9724, device='cuda:0', grad_fn=) [2024-06-18 23:06:18,333] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3553.12 | bwd_microstep: 1914.85 | bwd_inner_microstep: 1909.79 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9269, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.8933, device='cuda:0', grad_fn=) [2024-06-18 23:06:23,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:06:23,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3257.52 | bwd_microstep: 1901.38 | bwd_inner_microstep: 1895.83 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.30 [2024-06-18 23:06:23,659] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6810.61 | bwd: 3816.23 | bwd_inner: 3805.64 | bwd_allreduce: 10.39 | step: 61.38 47%|████▋ | 282/600 [51:16<58:13, 10.99s/it] {'loss': 0.9328, 'learning_rate': 5.726142856227452e-05, 'epoch': 2.82} 47%|████▋ | 282/600 [51:16<58:13, 10.99s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8933, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8740, device='cuda:0', grad_fn=) [2024-06-18 23:06:29,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3546.48 | bwd_microstep: 1907.61 | bwd_inner_microstep: 1902.54 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0079, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9656, device='cuda:0', grad_fn=) [2024-06-18 23:06:34,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 23:06:34,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3598.50 | bwd_microstep: 1986.72 | bwd_inner_microstep: 1981.07 | bwd_allreduce_microstep: 5.53 | step_microstep: 61.58 [2024-06-18 23:06:34,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7144.97 | bwd: 3894.32 | bwd_inner: 3883.66 | bwd_allreduce: 10.40 | step: 61.66 47%|████▋ | 283/600 [51:28<58:33, 11.08s/it] {'loss': 0.9198, 'learning_rate': 5.699428919783906e-05, 'epoch': 2.83} 47%|████▋ | 283/600 [51:28<58:33, 11.08s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0580, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(1.0225, device='cuda:0', grad_fn=) [2024-06-18 23:06:40,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3578.01 | bwd_microstep: 1959.83 | bwd_inner_microstep: 1954.76 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1024) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1024, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8036, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7821, device='cuda:0', grad_fn=) [2024-06-18 23:06:45,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:06:45,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3098.99 | bwd_microstep: 1883.47 | bwd_inner_microstep: 1877.95 | bwd_allreduce_microstep: 5.41 | step_microstep: 61.54 [2024-06-18 23:06:45,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6676.96 | bwd: 3843.29 | bwd_inner: 3832.78 | bwd_allreduce: 10.28 | step: 61.63 47%|████▋ | 284/600 [51:38<57:54, 11.00s/it] {'loss': 0.9023, 'learning_rate': 5.672694603709794e-05, 'epoch': 2.84} 47%|████▋ | 284/600 [51:38<57:54, 11.00s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9136, device='cuda:0', grad_fn=) tensor(0.6965, device='cuda:0', grad_fn=) tensor(0.8919, device='cuda:0', grad_fn=) [2024-06-18 23:06:51,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3572.80 | bwd_microstep: 1956.78 | bwd_inner_microstep: 1951.89 | bwd_allreduce_microstep: 4.82 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0125, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0924, device='cuda:0', grad_fn=) [2024-06-18 23:06:56,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:06:56,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2868.56 | bwd_microstep: 1678.88 | bwd_inner_microstep: 1673.41 | bwd_allreduce_microstep: 5.35 | step_microstep: 61.24 [2024-06-18 23:06:56,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6441.32 | bwd: 3635.66 | bwd_inner: 3625.29 | bwd_allreduce: 10.19 | step: 61.32 48%|████▊ | 285/600 [51:49<56:41, 10.80s/it] {'loss': 0.4921, 'learning_rate': 5.645940686977033e-05, 'epoch': 2.85} 48%|████▊ | 285/600 [51:49<56:41, 10.80s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9782, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9388, device='cuda:0', grad_fn=) [2024-06-18 23:07:01,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3552.65 | bwd_microstep: 1906.46 | bwd_inner_microstep: 1901.30 | bwd_allreduce_microstep: 5.05 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0020, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0829, device='cuda:0', grad_fn=) [2024-06-18 23:07:07,017] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:07:07,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3473.08 | bwd_microstep: 1738.86 | bwd_inner_microstep: 1733.40 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.14 [2024-06-18 23:07:07,018] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7025.71 | bwd: 3645.31 | bwd_inner: 3634.70 | bwd_allreduce: 10.46 | step: 61.24 48%|████▊ | 286/600 [52:00<56:42, 10.83s/it] {'loss': 0.5109, 'learning_rate': 5.619167949128652e-05, 'epoch': 2.86} 48%|████▊ | 286/600 [52:00<56:42, 10.83s/it]warning: The size of tensor a (0) must match the size of tensor b (1024) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1024, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9104, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8893, device='cuda:0', grad_fn=) [2024-06-18 23:07:11,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3016.82 | bwd_microstep: 1703.34 | bwd_inner_microstep: 1698.31 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3696, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.4137, device='cuda:0', grad_fn=) [2024-06-18 23:07:17,214] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:07:17,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3479.68 | bwd_microstep: 1745.98 | bwd_inner_microstep: 1740.49 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.37 [2024-06-18 23:07:17,215] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6496.48 | bwd: 3449.31 | bwd_inner: 3438.86 | bwd_allreduce: 10.25 | step: 61.46 48%|████▊ | 287/600 [52:10<55:31, 10.64s/it] {'loss': 0.6515, 'learning_rate': 5.59237717025608e-05, 'epoch': 2.87} 48%|████▊ | 287/600 [52:10<55:31, 10.64s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9872, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.9584, device='cuda:0', grad_fn=) [2024-06-18 23:07:22,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.01 | bwd_microstep: 1895.03 | bwd_inner_microstep: 1889.81 | bwd_allreduce_microstep: 5.08 | step_microstep: 0.10 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8058, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.7955, device='cuda:0', grad_fn=) [2024-06-18 23:07:28,363] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:07:28,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3551.22 | bwd_microstep: 1895.81 | bwd_inner_microstep: 1890.28 | bwd_allreduce_microstep: 5.36 | step_microstep: 61.54 [2024-06-18 23:07:28,365] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7100.20 | bwd: 3790.83 | bwd_inner: 3780.16 | bwd_allreduce: 10.45 | step: 61.65 48%|████▊ | 288/600 [52:21<56:08, 10.80s/it] {'loss': 0.8769, 'learning_rate': 5.565569130976422e-05, 'epoch': 2.88} 48%|████▊ | 288/600 [52:21<56:08, 10.80s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5960, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.5949, device='cuda:0', grad_fn=) [2024-06-18 23:07:33,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.97 | bwd_microstep: 1902.07 | bwd_inner_microstep: 1897.04 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8697, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.8412, device='cuda:0', grad_fn=) [2024-06-18 23:07:39,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:07:39,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3564.90 | bwd_microstep: 1935.55 | bwd_inner_microstep: 1930.05 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.17 [2024-06-18 23:07:39,583] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7114.86 | bwd: 3837.61 | bwd_inner: 3827.11 | bwd_allreduce: 10.31 | step: 61.25 48%|████▊ | 289/600 [52:32<56:36, 10.92s/it] {'loss': 0.718, 'learning_rate': 5.538744612409701e-05, 'epoch': 2.89} 48%|████▊ | 289/600 [52:32<56:36, 10.92s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0453, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0107, device='cuda:0', grad_fn=) [2024-06-18 23:07:45,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3550.22 | bwd_microstep: 1897.90 | bwd_inner_microstep: 1892.71 | bwd_allreduce_microstep: 5.06 | step_microstep: 0.14 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8549, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.8285, device='cuda:0', grad_fn=) [2024-06-18 23:07:50,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 23:07:50,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3571.53 | bwd_microstep: 1935.68 | bwd_inner_microstep: 1930.09 | bwd_allreduce_microstep: 5.41 | step_microstep: 61.78 [2024-06-18 23:07:50,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7121.73 | bwd: 3833.58 | bwd_inner: 3822.86 | bwd_allreduce: 10.47 | step: 61.93 48%|████▊ | 290/600 [52:43<56:54, 11.01s/it] {'loss': 0.9196, 'learning_rate': 5.5119043961561136e-05, 'epoch': 2.9} 48%|████▊ | 290/600 [52:43<56:54, 11.01s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1908, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.2532, device='cuda:0', grad_fn=) [2024-06-18 23:07:55,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3425.07 | bwd_microstep: 1639.08 | bwd_inner_microstep: 1633.93 | bwd_allreduce_microstep: 5.00 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.0696, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1437, device='cuda:0', grad_fn=) [2024-06-18 23:08:00,385] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 23:08:00,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2662.55 | bwd_microstep: 1612.67 | bwd_inner_microstep: 1606.74 | bwd_allreduce_microstep: 5.81 | step_microstep: 62.32 [2024-06-18 23:08:00,386] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6087.58 | bwd: 3251.75 | bwd_inner: 3240.72 | bwd_allreduce: 10.82 | step: 62.40 48%|████▊ | 291/600 [52:53<54:30, 10.58s/it] {'loss': 0.1985, 'learning_rate': 5.4850492642732406e-05, 'epoch': 2.91} 48%|████▊ | 291/600 [52:53<54:30, 10.58s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1801, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(1.1206, device='cuda:0', grad_fn=) [2024-06-18 23:08:05,279] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2953.42 | bwd_microstep: 1834.03 | bwd_inner_microstep: 1829.03 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9106, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.8780, device='cuda:0', grad_fn=) [2024-06-18 23:08:10,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:08:10,952] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3571.52 | bwd_microstep: 1936.55 | bwd_inner_microstep: 1930.95 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.67 [2024-06-18 23:08:10,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6524.92 | bwd: 3770.57 | bwd_inner: 3760.05 | bwd_allreduce: 10.33 | step: 61.76 49%|████▊ | 292/600 [53:04<54:17, 10.58s/it] {'loss': 0.9993, 'learning_rate': 5.458179999253275e-05, 'epoch': 2.92} 49%|████▊ | 292/600 [53:04<54:17, 10.58s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6559, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6603, device='cuda:0', grad_fn=) [2024-06-18 23:08:16,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3577.48 | bwd_microstep: 1909.40 | bwd_inner_microstep: 1904.11 | bwd_allreduce_microstep: 5.18 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9420, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9178, device='cuda:0', grad_fn=) [2024-06-18 23:08:22,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 2.00 [2024-06-18 23:08:22,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.43 | bwd_microstep: 1886.85 | bwd_inner_microstep: 1881.16 | bwd_allreduce_microstep: 5.58 | step_microstep: 63.21 [2024-06-18 23:08:22,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7126.90 | bwd: 3796.25 | bwd_inner: 3785.28 | bwd_allreduce: 10.76 | step: 63.29 49%|████▉ | 293/600 [53:15<55:03, 10.76s/it] {'loss': 0.789, 'learning_rate': 5.4312973840002045e-05, 'epoch': 2.93} 49%|████▉ | 293/600 [53:15<55:03, 10.76s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0882, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1608, device='cuda:0', grad_fn=) [2024-06-18 23:08:27,460] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3487.07 | bwd_microstep: 1743.73 | bwd_inner_microstep: 1738.60 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.2932, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.3339, device='cuda:0', grad_fn=) [2024-06-18 23:08:33,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:08:33,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3573.21 | bwd_microstep: 1923.48 | bwd_inner_microstep: 1918.04 | bwd_allreduce_microstep: 5.30 | step_microstep: 61.96 [2024-06-18 23:08:33,120] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7060.26 | bwd: 3667.20 | bwd_inner: 3656.70 | bwd_allreduce: 10.25 | step: 62.04 49%|████▉ | 294/600 [53:26<55:13, 10.83s/it] {'loss': 0.2474, 'learning_rate': 5.4044022018070214e-05, 'epoch': 2.94} 49%|████▉ | 294/600 [53:26<55:13, 10.83s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0466, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1234, device='cuda:0', grad_fn=) [2024-06-18 23:08:37,565] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2692.48 | bwd_microstep: 1657.62 | bwd_inner_microstep: 1652.44 | bwd_allreduce_microstep: 4.97 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0580, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1333, device='cuda:0', grad_fn=) [2024-06-18 23:08:42,916] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:08:42,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3474.54 | bwd_microstep: 1727.73 | bwd_inner_microstep: 1722.13 | bwd_allreduce_microstep: 5.49 | step_microstep: 62.98 [2024-06-18 23:08:42,917] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6166.99 | bwd: 3385.34 | bwd_inner: 3374.65 | bwd_allreduce: 10.45 | step: 63.06 49%|████▉ | 295/600 [53:36<53:27, 10.52s/it] {'loss': 0.1283, 'learning_rate': 5.37749523633288e-05, 'epoch': 2.95} 49%|████▉ | 295/600 [53:36<53:27, 10.52s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.2200, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.2795, device='cuda:0', grad_fn=) [2024-06-18 23:08:48,315] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3504.16 | bwd_microstep: 1801.32 | bwd_inner_microstep: 1796.19 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0587, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1343, device='cuda:0', grad_fn=) [2024-06-18 23:08:53,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:08:53,608] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3451.93 | bwd_microstep: 1693.34 | bwd_inner_microstep: 1687.56 | bwd_allreduce_microstep: 5.59 | step_microstep: 62.78 [2024-06-18 23:08:53,609] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6956.06 | bwd: 3494.65 | bwd_inner: 3483.84 | bwd_allreduce: 10.55 | step: 62.86 49%|████▉ | 296/600 [53:46<53:33, 10.57s/it] {'loss': 0.2069, 'learning_rate': 5.3505772715802704e-05, 'epoch': 2.96} 49%|████▉ | 296/600 [53:46<53:33, 10.57s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7915, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.7819, device='cuda:0', grad_fn=) [2024-06-18 23:08:59,170] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.05 | bwd_microstep: 1898.22 | bwd_inner_microstep: 1893.13 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6300, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.6366, device='cuda:0', grad_fn=) [2024-06-18 23:09:04,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:09:04,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3247.81 | bwd_microstep: 1853.54 | bwd_inner_microstep: 1848.03 | bwd_allreduce_microstep: 5.39 | step_microstep: 62.06 [2024-06-18 23:09:04,429] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6806.83 | bwd: 3751.75 | bwd_inner: 3741.21 | bwd_allreduce: 10.34 | step: 62.15 50%|████▉ | 297/600 [53:57<53:45, 10.64s/it] {'loss': 0.7092, 'learning_rate': 5.3236490918721794e-05, 'epoch': 2.97} 50%|████▉ | 297/600 [53:57<53:45, 10.64s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0035, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0842, device='cuda:0', grad_fn=) [2024-06-18 23:09:09,837] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3512.76 | bwd_microstep: 1801.80 | bwd_inner_microstep: 1796.59 | bwd_allreduce_microstep: 5.08 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8895, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8705, device='cuda:0', grad_fn=) [2024-06-18 23:09:15,498] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 2.00 [2024-06-18 23:09:15,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3572.55 | bwd_microstep: 1926.52 | bwd_inner_microstep: 1920.86 | bwd_allreduce_microstep: 5.54 | step_microstep: 62.07 [2024-06-18 23:09:15,499] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7085.28 | bwd: 3728.32 | bwd_inner: 3717.48 | bwd_allreduce: 10.63 | step: 62.15 50%|████▉ | 298/600 [54:08<54:13, 10.77s/it] {'loss': 0.4774, 'learning_rate': 5.296711481829226e-05, 'epoch': 2.98} 50%|████▉ | 298/600 [54:08<54:13, 10.77s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6671, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.6703, device='cuda:0', grad_fn=) [2024-06-18 23:09:21,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3558.46 | bwd_microstep: 1891.31 | bwd_inner_microstep: 1886.04 | bwd_allreduce_microstep: 5.15 | step_microstep: 0.14 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6075, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6171, device='cuda:0', grad_fn=) [2024-06-18 23:09:26,707] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:09:26,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3568.45 | bwd_microstep: 1926.79 | bwd_inner_microstep: 1921.23 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.85 [2024-06-18 23:09:26,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7126.86 | bwd: 3818.09 | bwd_inner: 3807.31 | bwd_allreduce: 10.59 | step: 62.00 50%|████▉ | 299/600 [54:19<54:41, 10.90s/it] {'loss': 0.6437, 'learning_rate': 5.2697652263468125e-05, 'epoch': 2.99} 50%|████▉ | 299/600 [54:19<54:41, 10.90s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7279, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.7255, device='cuda:0', grad_fn=) [2024-06-18 23:09:32,296] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3571.92 | bwd_microstep: 1914.71 | bwd_inner_microstep: 1909.67 | bwd_allreduce_microstep: 4.93 | step_microstep: 0.08 please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9041, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8836, device='cuda:0', grad_fn=) [2024-06-18 23:09:37,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:09:37,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2764.52 | bwd_microstep: 1802.77 | bwd_inner_microstep: 1797.10 | bwd_allreduce_microstep: 5.50 | step_microstep: 61.92 [2024-06-18 23:09:37,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6336.39 | bwd: 3717.47 | bwd_inner: 3706.82 | bwd_allreduce: 10.43 | step: 62.00 50%|█████ | 300/600 [54:30<54:51, 10.97s/it] {'loss': 0.8045, 'learning_rate': 5.242811110572242e-05, 'epoch': 3.0} 50%|█████ | 300/600 [54:30<54:51, 10.97s/it][2024-06-18 23:09:40,544] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-18 23:09:46,388] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-18 23:09:52,202] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-18 23:09:58,030] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4951, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5151, device='cuda:0', grad_fn=) [2024-06-18 23:10:07,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3584.75 | bwd_microstep: 1960.42 | bwd_inner_microstep: 1955.28 | bwd_allreduce_microstep: 5.03 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9057, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8739, device='cuda:0', grad_fn=) [2024-06-18 23:10:12,970] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:10:12,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3579.28 | bwd_microstep: 1970.54 | bwd_inner_microstep: 1965.06 | bwd_allreduce_microstep: 5.37 | step_microstep: 61.10 [2024-06-18 23:10:12,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7163.99 | bwd: 3930.95 | bwd_inner: 3920.35 | bwd_allreduce: 10.40 | step: 61.18 50%|█████ | 301/600 [55:06<1:30:47, 18.22s/it] {'loss': 0.6945, 'learning_rate': 5.2158499198818503e-05, 'epoch': 3.01} 50%|█████ | 301/600 [55:06<1:30:47, 18.22s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5623, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5757, device='cuda:0', grad_fn=) [2024-06-18 23:10:18,549] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3554.10 | bwd_microstep: 1921.91 | bwd_inner_microstep: 1916.97 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.9236, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9013, device='cuda:0', grad_fn=) [2024-06-18 23:10:23,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:10:23,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2755.06 | bwd_microstep: 1807.76 | bwd_inner_microstep: 1802.04 | bwd_allreduce_microstep: 5.61 | step_microstep: 61.72 [2024-06-18 23:10:23,271] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6309.14 | bwd: 3729.66 | bwd_inner: 3719.05 | bwd_allreduce: 10.42 | step: 61.80 50%|█████ | 302/600 [55:16<1:18:41, 15.84s/it] {'loss': 0.7385, 'learning_rate': 5.188882439858117e-05, 'epoch': 3.02} 50%|█████ | 302/600 [55:16<1:18:41, 15.84s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8336, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8202, device='cuda:0', grad_fn=) [2024-06-18 23:10:28,132] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2939.41 | bwd_microstep: 1821.67 | bwd_inner_microstep: 1816.74 | bwd_allreduce_microstep: 4.74 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0091, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0897, device='cuda:0', grad_fn=) [2024-06-18 23:10:33,493] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 23:10:33,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3472.97 | bwd_microstep: 1737.41 | bwd_inner_microstep: 1731.79 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.52 [2024-06-18 23:10:33,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6412.36 | bwd: 3559.08 | bwd_inner: 3548.61 | bwd_allreduce: 10.16 | step: 61.59 50%|█████ | 303/600 [55:26<1:10:04, 14.16s/it] {'loss': 0.455, 'learning_rate': 5.1619094562667804e-05, 'epoch': 3.03} 50%|█████ | 303/600 [55:26<1:10:04, 14.16s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6279, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6351, device='cuda:0', grad_fn=) [2024-06-18 23:10:39,020] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3544.40 | bwd_microstep: 1884.04 | bwd_inner_microstep: 1879.06 | bwd_allreduce_microstep: 4.83 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5449, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5600, device='cuda:0', grad_fn=) [2024-06-18 23:10:44,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:10:44,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3546.23 | bwd_microstep: 1892.16 | bwd_inner_microstep: 1886.70 | bwd_allreduce_microstep: 5.34 | step_microstep: 61.35 [2024-06-18 23:10:44,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7090.61 | bwd: 3776.20 | bwd_inner: 3765.83 | bwd_allreduce: 10.17 | step: 61.43 51%|█████ | 304/600 [55:37<1:05:21, 13.25s/it] {'loss': 0.5975, 'learning_rate': 5.134931755033936e-05, 'epoch': 3.04} 51%|█████ | 304/600 [55:37<1:05:21, 13.25s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0142, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0943, device='cuda:0', grad_fn=) [2024-06-18 23:10:49,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3446.06 | bwd_microstep: 1692.11 | bwd_inner_microstep: 1687.03 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.0444, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1215, device='cuda:0', grad_fn=) [2024-06-18 23:10:54,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:10:54,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2669.64 | bwd_microstep: 1640.29 | bwd_inner_microstep: 1634.79 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.16 [2024-06-18 23:10:54,301] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6115.68 | bwd: 3332.40 | bwd_inner: 3321.89 | bwd_allreduce: 10.25 | step: 61.24 51%|█████ | 305/600 [55:47<59:52, 12.18s/it] {'loss': 0.1079, 'learning_rate': 5.107950122223139e-05, 'epoch': 3.05} 51%|█████ | 305/600 [55:47<59:52, 12.18s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0371, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0034, device='cuda:0', grad_fn=) [2024-06-18 23:10:59,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3577.72 | bwd_microstep: 1961.94 | bwd_inner_microstep: 1957.03 | bwd_allreduce_microstep: 4.80 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7930, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7725, device='cuda:0', grad_fn=) [2024-06-18 23:11:05,505] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:11:05,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3523.23 | bwd_microstep: 1870.90 | bwd_inner_microstep: 1865.36 | bwd_allreduce_microstep: 5.35 | step_microstep: 61.56 [2024-06-18 23:11:05,506] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7100.94 | bwd: 3832.84 | bwd_inner: 3822.47 | bwd_allreduce: 10.16 | step: 61.64 51%|█████ | 306/600 [55:58<58:14, 11.89s/it] {'loss': 0.888, 'learning_rate': 5.080965344012508e-05, 'epoch': 3.06} 51%|█████ | 306/600 [55:58<58:14, 11.89s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0275, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1063, device='cuda:0', grad_fn=) [2024-06-18 23:11:10,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3471.46 | bwd_microstep: 1725.54 | bwd_inner_microstep: 1720.54 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7165, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.7033, device='cuda:0', grad_fn=) [2024-06-18 23:11:16,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:11:16,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.24 | bwd_microstep: 1908.57 | bwd_inner_microstep: 1902.86 | bwd_allreduce_microstep: 5.57 | step_microstep: 61.61 [2024-06-18 23:11:16,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7027.68 | bwd: 3634.10 | bwd_inner: 3623.45 | bwd_allreduce: 10.45 | step: 61.68 51%|█████ | 307/600 [56:09<56:37, 11.60s/it] {'loss': 0.4048, 'learning_rate': 5.053978206671801e-05, 'epoch': 3.07} 51%|█████ | 307/600 [56:09<56:37, 11.60s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4191, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.4472, device='cuda:0', grad_fn=) [2024-06-18 23:11:21,958] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3550.05 | bwd_microstep: 1886.63 | bwd_inner_microstep: 1881.57 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0553, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1312, device='cuda:0', grad_fn=) [2024-06-18 23:11:27,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:11:27,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3447.67 | bwd_microstep: 1693.62 | bwd_inner_microstep: 1688.14 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.26 [2024-06-18 23:11:27,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6997.70 | bwd: 3580.24 | bwd_inner: 3569.75 | bwd_allreduce: 10.29 | step: 61.34 51%|█████▏ | 308/600 [56:20<55:18, 11.36s/it] {'loss': 0.2892, 'learning_rate': 5.0269894965395225e-05, 'epoch': 3.08} 51%|█████▏ | 308/600 [56:20<55:18, 11.36s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9575, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.9316, device='cuda:0', grad_fn=) [2024-06-18 23:11:32,923] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3597.36 | bwd_microstep: 1970.80 | bwd_inner_microstep: 1965.81 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5925, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.5917, device='cuda:0', grad_fn=) [2024-06-18 23:11:38,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:11:38,579] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.99 | bwd_microstep: 1932.20 | bwd_inner_microstep: 1926.58 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.58 [2024-06-18 23:11:38,580] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7157.34 | bwd: 3903.00 | bwd_inner: 3892.47 | bwd_allreduce: 10.29 | step: 61.66 52%|█████▏ | 309/600 [56:31<55:04, 11.35s/it] {'loss': 0.7617, 'learning_rate': 5e-05, 'epoch': 3.09} 52%|█████▏ | 309/600 [56:31<55:04, 11.35s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3176, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.3558, device='cuda:0', grad_fn=) [2024-06-18 23:11:44,163] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.81 | bwd_microstep: 1922.88 | bwd_inner_microstep: 1917.96 | bwd_allreduce_microstep: 4.82 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6540, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.6582, device='cuda:0', grad_fn=) [2024-06-18 23:11:49,769] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:11:49,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3548.79 | bwd_microstep: 1898.56 | bwd_inner_microstep: 1893.10 | bwd_allreduce_microstep: 5.35 | step_microstep: 61.20 [2024-06-18 23:11:49,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7105.56 | bwd: 3821.44 | bwd_inner: 3811.07 | bwd_allreduce: 10.18 | step: 61.28 52%|█████▏ | 310/600 [56:42<54:38, 11.31s/it] {'loss': 0.507, 'learning_rate': 4.973010503460479e-05, 'epoch': 3.1} 52%|█████▏ | 310/600 [56:42<54:38, 11.31s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9621, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.9247, device='cuda:0', grad_fn=) [2024-06-18 23:11:55,366] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3561.03 | bwd_microstep: 1930.94 | bwd_inner_microstep: 1925.99 | bwd_allreduce_microstep: 4.83 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8448, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.8187, device='cuda:0', grad_fn=) [2024-06-18 23:12:01,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:12:01,029] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3561.24 | bwd_microstep: 1935.99 | bwd_inner_microstep: 1930.20 | bwd_allreduce_microstep: 5.66 | step_microstep: 63.91 [2024-06-18 23:12:01,030] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7122.23 | bwd: 3866.93 | bwd_inner: 3856.22 | bwd_allreduce: 10.50 | step: 63.99 52%|█████▏ | 311/600 [56:54<54:23, 11.29s/it] {'loss': 0.8717, 'learning_rate': 4.946021793328201e-05, 'epoch': 3.11} 52%|█████▏ | 311/600 [56:54<54:23, 11.29s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.7923, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.7829, device='cuda:0', grad_fn=) [2024-06-18 23:12:03,655] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1705.49 | bwd_microstep: 849.66 | bwd_inner_microstep: 844.74 | bwd_allreduce_microstep: 4.82 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.7326, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7294, device='cuda:0', grad_fn=) [2024-06-18 23:12:08,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.81 [2024-06-18 23:12:08,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2891.53 | bwd_microstep: 1753.64 | bwd_inner_microstep: 1748.17 | bwd_allreduce_microstep: 5.36 | step_microstep: 61.45 [2024-06-18 23:12:08,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4597.01 | bwd: 2603.30 | bwd_inner: 2592.93 | bwd_allreduce: 10.18 | step: 61.53 52%|█████▏ | 312/600 [57:01<48:38, 10.13s/it] {'loss': 0.7561, 'learning_rate': 4.919034655987493e-05, 'epoch': 3.12} 52%|█████▏ | 312/600 [57:01<48:38, 10.13s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.1339, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.2016, device='cuda:0', grad_fn=) [2024-06-18 23:12:11,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1702.07 | bwd_microstep: 842.17 | bwd_inner_microstep: 837.18 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9944, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9534, device='cuda:0', grad_fn=) [2024-06-18 23:12:16,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 23:12:16,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3581.24 | bwd_microstep: 1973.58 | bwd_inner_microstep: 1968.08 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.28 [2024-06-18 23:12:16,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5283.29 | bwd: 2815.74 | bwd_inner: 2805.31 | bwd_allreduce: 10.21 | step: 61.36 52%|█████▏ | 313/600 [57:09<45:52, 9.59s/it] {'loss': 0.5775, 'learning_rate': 4.892049877776861e-05, 'epoch': 3.13} 52%|█████▏ | 313/600 [57:09<45:52, 9.59s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7312, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7281, device='cuda:0', grad_fn=) [2024-06-18 23:12:22,356] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3554.25 | bwd_microstep: 1913.02 | bwd_inner_microstep: 1907.83 | bwd_allreduce_microstep: 5.01 | step_microstep: 0.13 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0011, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0816, device='cuda:0', grad_fn=) [2024-06-18 23:12:27,820] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:12:27,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3498.26 | bwd_microstep: 1813.08 | bwd_inner_microstep: 1807.53 | bwd_allreduce_microstep: 5.37 | step_microstep: 60.89 [2024-06-18 23:12:27,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7052.47 | bwd: 3726.09 | bwd_inner: 3715.45 | bwd_allreduce: 10.34 | step: 61.03 52%|█████▏ | 314/600 [57:20<47:46, 10.02s/it] {'loss': 0.4048, 'learning_rate': 4.865068244966066e-05, 'epoch': 3.14} 52%|█████▏ | 314/600 [57:20<47:46, 10.02s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7122, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.6994, device='cuda:0', grad_fn=) [2024-06-18 23:12:33,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3566.52 | bwd_microstep: 1936.42 | bwd_inner_microstep: 1931.47 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8709, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8426, device='cuda:0', grad_fn=) [2024-06-18 23:12:39,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:12:39,053] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3554.29 | bwd_microstep: 1905.85 | bwd_inner_microstep: 1899.99 | bwd_allreduce_microstep: 5.72 | step_microstep: 61.76 [2024-06-18 23:12:39,054] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7120.80 | bwd: 3842.26 | bwd_inner: 3831.47 | bwd_allreduce: 10.56 | step: 61.84 52%|█████▎ | 315/600 [57:32<49:20, 10.39s/it] {'loss': 0.771, 'learning_rate': 4.838090543733222e-05, 'epoch': 3.15} 52%|█████▎ | 315/600 [57:32<49:20, 10.39s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8601, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.8326, device='cuda:0', grad_fn=) [2024-06-18 23:12:44,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3576.97 | bwd_microstep: 1965.95 | bwd_inner_microstep: 1960.96 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0024, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0832, device='cuda:0', grad_fn=) [2024-06-18 23:12:50,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:12:50,070] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3476.68 | bwd_microstep: 1739.49 | bwd_inner_microstep: 1734.11 | bwd_allreduce_microstep: 5.22 | step_microstep: 61.17 [2024-06-18 23:12:50,071] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7053.63 | bwd: 3705.44 | bwd_inner: 3695.13 | bwd_allreduce: 10.10 | step: 61.25 53%|█████▎ | 316/600 [57:43<50:03, 10.58s/it] {'loss': 0.4579, 'learning_rate': 4.8111175601418844e-05, 'epoch': 3.16} 53%|█████▎ | 316/600 [57:43<50:03, 10.58s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0772, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(1.0390, device='cuda:0', grad_fn=) [2024-06-18 23:12:55,713] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3581.78 | bwd_microstep: 1956.68 | bwd_inner_microstep: 1951.52 | bwd_allreduce_microstep: 5.01 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0046, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.0860, device='cuda:0', grad_fn=) [2024-06-18 23:13:01,013] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 2.20 [2024-06-18 23:13:01,014] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3447.45 | bwd_microstep: 1694.82 | bwd_inner_microstep: 1689.19 | bwd_allreduce_microstep: 5.49 | step_microstep: 72.18 [2024-06-18 23:13:01,015] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7029.22 | bwd: 3651.50 | bwd_inner: 3640.78 | bwd_allreduce: 10.51 | step: 72.27 53%|█████▎ | 317/600 [57:54<50:24, 10.69s/it] {'loss': 0.5625, 'learning_rate': 4.784150080118152e-05, 'epoch': 3.17} 53%|█████▎ | 317/600 [57:54<50:24, 10.69s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0015, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0825, device='cuda:0', grad_fn=) [2024-06-18 23:13:06,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3477.67 | bwd_microstep: 1741.70 | bwd_inner_microstep: 1736.74 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8255, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.8125, device='cuda:0', grad_fn=) [2024-06-18 23:13:11,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:13:11,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2782.73 | bwd_microstep: 1864.62 | bwd_inner_microstep: 1859.03 | bwd_allreduce_microstep: 5.41 | step_microstep: 61.33 [2024-06-18 23:13:11,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6260.39 | bwd: 3606.31 | bwd_inner: 3595.83 | bwd_allreduce: 10.23 | step: 61.41 53%|█████▎ | 318/600 [58:04<49:25, 10.52s/it] {'loss': 0.4475, 'learning_rate': 4.7571888894277604e-05, 'epoch': 3.18} 53%|█████▎ | 318/600 [58:04<49:25, 10.52s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1333, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.2010, device='cuda:0', grad_fn=) [2024-06-18 23:13:16,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3495.27 | bwd_microstep: 1807.23 | bwd_inner_microstep: 1802.18 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) tensor(0.0114, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0918, device='cuda:0', grad_fn=) [2024-06-18 23:13:21,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:13:21,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3170.83 | bwd_microstep: 1712.80 | bwd_inner_microstep: 1707.30 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.91 [2024-06-18 23:13:21,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6666.06 | bwd: 3520.02 | bwd_inner: 3509.49 | bwd_allreduce: 10.34 | step: 61.99 53%|█████▎ | 319/600 [58:14<49:08, 10.49s/it] {'loss': 0.1464, 'learning_rate': 4.730234773653188e-05, 'epoch': 3.19} 53%|█████▎ | 319/600 [58:14<49:08, 10.49s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0005, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0815, device='cuda:0', grad_fn=) [2024-06-18 23:13:26,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3478.16 | bwd_microstep: 1745.10 | bwd_inner_microstep: 1740.14 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7125, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7112, device='cuda:0', grad_fn=) [2024-06-18 23:13:32,481] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:13:32,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3546.64 | bwd_microstep: 1889.18 | bwd_inner_microstep: 1883.59 | bwd_allreduce_microstep: 5.47 | step_microstep: 62.62 [2024-06-18 23:13:32,482] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7024.79 | bwd: 3634.27 | bwd_inner: 3623.76 | bwd_allreduce: 10.29 | step: 62.71 53%|█████▎ | 320/600 [58:25<49:32, 10.62s/it] {'loss': 0.3964, 'learning_rate': 4.703288518170774e-05, 'epoch': 3.2} 53%|█████▎ | 320/600 [58:25<49:32, 10.62s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0032, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0840, device='cuda:0', grad_fn=) [2024-06-18 23:13:37,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3493.06 | bwd_microstep: 1805.45 | bwd_inner_microstep: 1800.60 | bwd_allreduce_microstep: 4.72 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8201, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.7966, device='cuda:0', grad_fn=) [2024-06-18 23:13:43,613] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:13:43,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3593.13 | bwd_microstep: 1977.97 | bwd_inner_microstep: 1972.33 | bwd_allreduce_microstep: 5.53 | step_microstep: 61.95 [2024-06-18 23:13:43,614] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7086.16 | bwd: 3783.42 | bwd_inner: 3772.96 | bwd_allreduce: 10.25 | step: 62.03 54%|█████▎ | 321/600 [58:36<50:05, 10.77s/it] {'loss': 0.4403, 'learning_rate': 4.676350908127822e-05, 'epoch': 3.21} 54%|█████▎ | 321/600 [58:36<50:05, 10.77s/it]warning: The size of tensor a (0) must match the size of tensor b (1024) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1024, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8711, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.8536, device='cuda:0', grad_fn=) [2024-06-18 23:13:48,439] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3019.34 | bwd_microstep: 1706.32 | bwd_inner_microstep: 1701.25 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3276, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.3648, device='cuda:0', grad_fn=) [2024-06-18 23:13:54,093] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 23:13:54,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3564.30 | bwd_microstep: 1927.62 | bwd_inner_microstep: 1922.08 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.52 [2024-06-18 23:13:54,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6583.62 | bwd: 3633.93 | bwd_inner: 3623.38 | bwd_allreduce: 10.31 | step: 61.60 54%|█████▎ | 322/600 [58:47<49:30, 10.68s/it] {'loss': 0.6092, 'learning_rate': 4.6494227284197294e-05, 'epoch': 3.22} 54%|█████▎ | 322/600 [58:47<49:30, 10.68s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0150, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0941, device='cuda:0', grad_fn=) [2024-06-18 23:13:59,378] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3467.34 | bwd_microstep: 1727.93 | bwd_inner_microstep: 1722.61 | bwd_allreduce_microstep: 5.19 | step_microstep: 0.14 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5569, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.5600, device='cuda:0', grad_fn=) [2024-06-18 23:14:05,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:14:05,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3580.77 | bwd_microstep: 1970.24 | bwd_inner_microstep: 1964.64 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.57 [2024-06-18 23:14:05,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7048.08 | bwd: 3698.17 | bwd_inner: 3687.34 | bwd_allreduce: 10.60 | step: 61.71 54%|█████▍ | 323/600 [58:58<49:45, 10.78s/it] {'loss': 0.3271, 'learning_rate': 4.622504763667122e-05, 'epoch': 3.23} 54%|█████▍ | 323/600 [58:58<49:45, 10.78s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7982, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7884, device='cuda:0', grad_fn=) [2024-06-18 23:14:10,629] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3546.43 | bwd_microstep: 1887.50 | bwd_inner_microstep: 1882.54 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4836, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.5056, device='cuda:0', grad_fn=) [2024-06-18 23:14:16,265] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:14:16,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3557.99 | bwd_microstep: 1917.83 | bwd_inner_microstep: 1912.31 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.25 [2024-06-18 23:14:16,266] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7104.40 | bwd: 3805.33 | bwd_inner: 3794.89 | bwd_allreduce: 10.24 | step: 61.33 54%|█████▍ | 324/600 [59:09<50:07, 10.90s/it] {'loss': 0.647, 'learning_rate': 4.59559779819298e-05, 'epoch': 3.24} 54%|█████▍ | 324/600 [59:09<50:07, 10.90s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4517, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.4764, device='cuda:0', grad_fn=) [2024-06-18 23:14:21,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3557.76 | bwd_microstep: 1925.10 | bwd_inner_microstep: 1919.97 | bwd_allreduce_microstep: 5.02 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6987, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.6876, device='cuda:0', grad_fn=) [2024-06-18 23:14:27,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:14:27,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3566.03 | bwd_microstep: 1935.15 | bwd_inner_microstep: 1929.65 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.15 [2024-06-18 23:14:27,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7123.76 | bwd: 3860.25 | bwd_inner: 3849.64 | bwd_allreduce: 10.43 | step: 61.23 54%|█████▍ | 325/600 [59:20<50:25, 11.00s/it] {'loss': 0.582, 'learning_rate': 4.568702615999797e-05, 'epoch': 3.25} 54%|█████▍ | 325/600 [59:20<50:25, 11.00s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.7170, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.7149, device='cuda:0', grad_fn=) [2024-06-18 23:14:32,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2746.14 | bwd_microstep: 1804.53 | bwd_inner_microstep: 1799.59 | bwd_allreduce_microstep: 4.83 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0138, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0931, device='cuda:0', grad_fn=) [2024-06-18 23:14:37,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 23:14:37,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3441.31 | bwd_microstep: 1692.61 | bwd_inner_microstep: 1686.68 | bwd_allreduce_microstep: 5.75 | step_microstep: 62.87 [2024-06-18 23:14:37,452] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6187.43 | bwd: 3497.14 | bwd_inner: 3486.34 | bwd_allreduce: 10.59 | step: 62.95 54%|█████▍ | 326/600 [59:30<48:46, 10.68s/it] {'loss': 0.404, 'learning_rate': 4.541820000746727e-05, 'epoch': 3.26} 54%|█████▍ | 326/600 [59:30<48:46, 10.68s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0282, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1069, device='cuda:0', grad_fn=) [2024-06-18 23:14:42,755] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3471.52 | bwd_microstep: 1738.64 | bwd_inner_microstep: 1733.68 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7089, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.7083, device='cuda:0', grad_fn=) [2024-06-18 23:14:48,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.93 [2024-06-18 23:14:48,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3578.05 | bwd_microstep: 1954.79 | bwd_inner_microstep: 1949.23 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.38 [2024-06-18 23:14:48,451] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7049.56 | bwd: 3693.43 | bwd_inner: 3682.97 | bwd_allreduce: 10.22 | step: 61.46 55%|█████▍ | 327/600 [59:41<49:02, 10.78s/it] {'loss': 0.4076, 'learning_rate': 4.51495073572676e-05, 'epoch': 3.27} 55%|█████▍ | 327/600 [59:41<49:02, 10.78s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.2435, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.2887, device='cuda:0', grad_fn=) [2024-06-18 23:14:54,102] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3582.68 | bwd_microstep: 1962.88 | bwd_inner_microstep: 1957.83 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.14 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7751, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7675, device='cuda:0', grad_fn=) [2024-06-18 23:14:59,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 23:14:59,757] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3561.29 | bwd_microstep: 1927.80 | bwd_inner_microstep: 1922.05 | bwd_allreduce_microstep: 5.63 | step_microstep: 62.96 [2024-06-18 23:14:59,758] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7143.94 | bwd: 3890.67 | bwd_inner: 3879.93 | bwd_allreduce: 10.55 | step: 63.10 55%|█████▍ | 328/600 [59:52<49:34, 10.94s/it] {'loss': 0.5281, 'learning_rate': 4.4880956038438876e-05, 'epoch': 3.28} 55%|█████▍ | 328/600 [59:52<49:34, 10.94s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6173, device='cuda:0', grad_fn=) tensor(0.7039, device='cuda:0', grad_fn=) tensor(0.6260, device='cuda:0', grad_fn=) [2024-06-18 23:15:05,405] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3579.19 | bwd_microstep: 1963.31 | bwd_inner_microstep: 1958.23 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5580, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5722, device='cuda:0', grad_fn=) [2024-06-18 23:15:09,993] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:15:09,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2704.91 | bwd_microstep: 1727.17 | bwd_inner_microstep: 1721.71 | bwd_allreduce_microstep: 5.35 | step_microstep: 61.54 [2024-06-18 23:15:09,994] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6284.07 | bwd: 3690.47 | bwd_inner: 3680.01 | bwd_allreduce: 10.21 | step: 61.62 55%|█████▍ | 329/600 [1:00:03<48:26, 10.73s/it] {'loss': 0.5991, 'learning_rate': 4.461255387590299e-05, 'epoch': 3.29} 55%|█████▍ | 329/600 [1:00:03<48:26, 10.73s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6491, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.6541, device='cuda:0', grad_fn=) [2024-06-18 23:15:15,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3577.77 | bwd_microstep: 1952.67 | bwd_inner_microstep: 1947.63 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4459, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.4709, device='cuda:0', grad_fn=) [2024-06-18 23:15:21,218] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:15:21,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3547.54 | bwd_microstep: 1887.55 | bwd_inner_microstep: 1882.06 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.11 [2024-06-18 23:15:21,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7125.27 | bwd: 3840.22 | bwd_inner: 3829.74 | bwd_allreduce: 10.29 | step: 61.19 55%|█████▌ | 330/600 [1:00:14<48:56, 10.88s/it] {'loss': 0.5625, 'learning_rate': 4.434430869023579e-05, 'epoch': 3.3} 55%|█████▌ | 330/600 [1:00:14<48:56, 10.88s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0166, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0960, device='cuda:0', grad_fn=) [2024-06-18 23:15:26,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3500.66 | bwd_microstep: 1801.66 | bwd_inner_microstep: 1796.48 | bwd_allreduce_microstep: 5.06 | step_microstep: 0.09 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8713, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.8538, device='cuda:0', grad_fn=) [2024-06-18 23:15:32,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 23:15:32,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3576.32 | bwd_microstep: 1952.08 | bwd_inner_microstep: 1946.55 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.80 [2024-06-18 23:15:32,307] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7076.97 | bwd: 3753.73 | bwd_inner: 3743.05 | bwd_allreduce: 10.49 | step: 61.89 55%|█████▌ | 331/600 [1:00:25<49:02, 10.94s/it] {'loss': 0.4749, 'learning_rate': 4.4076228297439204e-05, 'epoch': 3.31} 55%|█████▌ | 331/600 [1:00:25<49:02, 10.94s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0722, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(1.0345, device='cuda:0', grad_fn=) [2024-06-18 23:15:37,875] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3552.01 | bwd_microstep: 1913.55 | bwd_inner_microstep: 1908.45 | bwd_allreduce_microstep: 4.98 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6507, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.6444, device='cuda:0', grad_fn=) [2024-06-18 23:15:43,502] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:15:43,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3554.81 | bwd_microstep: 1909.97 | bwd_inner_microstep: 1904.50 | bwd_allreduce_microstep: 5.35 | step_microstep: 61.61 [2024-06-18 23:15:43,503] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7106.81 | bwd: 3823.51 | bwd_inner: 3812.99 | bwd_allreduce: 10.34 | step: 61.70 55%|█████▌ | 332/600 [1:00:36<49:12, 11.02s/it] {'loss': 0.8395, 'learning_rate': 4.38083205087135e-05, 'epoch': 3.32} 55%|█████▌ | 332/600 [1:00:36<49:12, 11.02s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0554, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1314, device='cuda:0', grad_fn=) [2024-06-18 23:15:47,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2673.87 | bwd_microstep: 1638.42 | bwd_inner_microstep: 1633.31 | bwd_allreduce_microstep: 4.93 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0024, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0833, device='cuda:0', grad_fn=) [2024-06-18 23:15:53,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:15:53,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3497.83 | bwd_microstep: 1810.06 | bwd_inner_microstep: 1804.53 | bwd_allreduce_microstep: 5.35 | step_microstep: 61.60 [2024-06-18 23:15:53,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6171.67 | bwd: 3448.48 | bwd_inner: 3437.94 | bwd_allreduce: 10.22 | step: 61.68 56%|█████▌ | 333/600 [1:00:46<47:29, 10.67s/it] {'loss': 0.1073, 'learning_rate': 4.35405931302297e-05, 'epoch': 3.33} 56%|█████▌ | 333/600 [1:00:46<47:29, 10.67s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4284, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.4551, device='cuda:0', grad_fn=) [2024-06-18 23:15:58,895] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3542.66 | bwd_microstep: 1885.76 | bwd_inner_microstep: 1880.73 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0097, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.9679, device='cuda:0', grad_fn=) [2024-06-18 23:16:04,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:16:04,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3591.24 | bwd_microstep: 1998.78 | bwd_inner_microstep: 1993.29 | bwd_allreduce_microstep: 5.34 | step_microstep: 61.32 [2024-06-18 23:16:04,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7133.88 | bwd: 3884.54 | bwd_inner: 3874.11 | bwd_allreduce: 10.20 | step: 61.39 56%|█████▌ | 334/600 [1:00:57<48:07, 10.86s/it] {'loss': 0.7115, 'learning_rate': 4.3273053962902076e-05, 'epoch': 3.34} 56%|█████▌ | 334/600 [1:00:57<48:07, 10.86s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6718, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6749, device='cuda:0', grad_fn=) [2024-06-18 23:16:10,252] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3563.17 | bwd_microstep: 1926.14 | bwd_inner_microstep: 1921.11 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7803, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.7607, device='cuda:0', grad_fn=) [2024-06-18 23:16:15,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 23:16:15,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3561.03 | bwd_microstep: 1905.09 | bwd_inner_microstep: 1899.37 | bwd_allreduce_microstep: 5.61 | step_microstep: 61.79 [2024-06-18 23:16:15,880] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7124.19 | bwd: 3831.23 | bwd_inner: 3820.53 | bwd_allreduce: 10.45 | step: 61.87 56%|█████▌ | 335/600 [1:01:09<48:26, 10.97s/it] {'loss': 0.7178, 'learning_rate': 4.3005710802160965e-05, 'epoch': 3.35} 56%|█████▌ | 335/600 [1:01:09<48:26, 10.97s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0024, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0833, device='cuda:0', grad_fn=) [2024-06-18 23:16:21,189] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3478.14 | bwd_microstep: 1738.18 | bwd_inner_microstep: 1733.26 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8641, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8477, device='cuda:0', grad_fn=) [2024-06-18 23:16:25,610] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.93 [2024-06-18 23:16:25,611] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2651.91 | bwd_microstep: 1616.93 | bwd_inner_microstep: 1611.43 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.05 [2024-06-18 23:16:25,612] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6130.03 | bwd: 3355.11 | bwd_inner: 3344.71 | bwd_allreduce: 10.21 | step: 61.13 56%|█████▌ | 336/600 [1:01:18<46:37, 10.60s/it] {'loss': 0.4655, 'learning_rate': 4.27385714377255e-05, 'epoch': 3.36} 56%|█████▌ | 336/600 [1:01:18<46:37, 10.60s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0064, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0869, device='cuda:0', grad_fn=) [2024-06-18 23:16:30,164] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2837.18 | bwd_microstep: 1626.39 | bwd_inner_microstep: 1621.36 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6900, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.6798, device='cuda:0', grad_fn=) [2024-06-18 23:16:35,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:16:35,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3557.93 | bwd_microstep: 1921.15 | bwd_inner_microstep: 1915.54 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.48 [2024-06-18 23:16:35,807] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6395.09 | bwd: 3547.54 | bwd_inner: 3537.03 | bwd_allreduce: 10.24 | step: 61.56 56%|█████▌ | 337/600 [1:01:28<45:55, 10.48s/it] {'loss': 0.3833, 'learning_rate': 4.2471643653376685e-05, 'epoch': 3.37} 56%|█████▌ | 337/600 [1:01:28<45:55, 10.48s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8978, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.8779, device='cuda:0', grad_fn=) [2024-06-18 23:16:41,474] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3585.93 | bwd_microstep: 1972.02 | bwd_inner_microstep: 1967.06 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9270, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8931, device='cuda:0', grad_fn=) [2024-06-18 23:16:46,422] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:16:46,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2947.83 | bwd_microstep: 1839.43 | bwd_inner_microstep: 1833.95 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.20 [2024-06-18 23:16:46,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6533.75 | bwd: 3811.44 | bwd_inner: 3801.02 | bwd_allreduce: 10.24 | step: 61.28 56%|█████▋ | 338/600 [1:01:39<45:55, 10.52s/it] {'loss': 0.8855, 'learning_rate': 4.220493522673067e-05, 'epoch': 3.38} 56%|█████▋ | 338/600 [1:01:39<45:55, 10.52s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.0206, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0992, device='cuda:0', grad_fn=) [2024-06-18 23:16:50,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2696.79 | bwd_microstep: 1720.52 | bwd_inner_microstep: 1715.49 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4071, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.4363, device='cuda:0', grad_fn=) [2024-06-18 23:16:56,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:16:56,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3579.38 | bwd_microstep: 1959.18 | bwd_inner_microstep: 1953.68 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.07 [2024-06-18 23:16:56,636] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6276.15 | bwd: 3679.69 | bwd_inner: 3669.22 | bwd_allreduce: 10.25 | step: 61.15 56%|█████▋ | 339/600 [1:01:49<45:21, 10.43s/it] {'loss': 0.2677, 'learning_rate': 4.193845392901201e-05, 'epoch': 3.39} 56%|█████▋ | 339/600 [1:01:49<45:21, 10.43s/it]warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8670, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8391, device='cuda:0', grad_fn=) [2024-06-18 23:17:01,887] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3250.97 | bwd_microstep: 1893.50 | bwd_inner_microstep: 1888.54 | bwd_allreduce_microstep: 4.82 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7539, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7373, device='cuda:0', grad_fn=) [2024-06-18 23:17:07,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:17:07,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3591.49 | bwd_microstep: 1970.92 | bwd_inner_microstep: 1965.29 | bwd_allreduce_microstep: 5.51 | step_microstep: 61.34 [2024-06-18 23:17:07,618] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6842.43 | bwd: 3864.41 | bwd_inner: 3853.87 | bwd_allreduce: 10.34 | step: 61.42 57%|█████▋ | 340/600 [1:02:00<45:54, 10.59s/it] {'loss': 0.7882, 'learning_rate': 4.1672207524827275e-05, 'epoch': 3.4} 57%|█████▋ | 340/600 [1:02:00<45:54, 10.59s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6615, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6657, device='cuda:0', grad_fn=) [2024-06-18 23:17:13,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3575.00 | bwd_microstep: 1956.87 | bwd_inner_microstep: 1951.87 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5058, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5248, device='cuda:0', grad_fn=) [2024-06-18 23:17:18,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:17:18,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3552.50 | bwd_microstep: 1908.14 | bwd_inner_microstep: 1902.61 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.21 [2024-06-18 23:17:18,872] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7127.49 | bwd: 3865.00 | bwd_inner: 3854.50 | bwd_allreduce: 10.31 | step: 61.29 57%|█████▋ | 341/600 [1:02:12<46:34, 10.79s/it] {'loss': 0.5953, 'learning_rate': 4.140620377193885e-05, 'epoch': 3.41} 57%|█████▋ | 341/600 [1:02:12<46:34, 10.79s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7636, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7572, device='cuda:0', grad_fn=) [2024-06-18 23:17:23,281] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2670.27 | bwd_microstep: 1645.20 | bwd_inner_microstep: 1640.21 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8711, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.8543, device='cuda:0', grad_fn=) [2024-06-18 23:17:28,870] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:17:28,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3546.67 | bwd_microstep: 1887.37 | bwd_inner_microstep: 1881.80 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.30 [2024-06-18 23:17:28,871] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6216.92 | bwd: 3532.56 | bwd_inner: 3522.07 | bwd_allreduce: 10.27 | step: 61.38 57%|█████▋ | 342/600 [1:02:22<45:22, 10.55s/it] {'loss': 0.8057, 'learning_rate': 4.114045042103887e-05, 'epoch': 3.42} 57%|█████▋ | 342/600 [1:02:22<45:22, 10.55s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0676, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(1.0307, device='cuda:0', grad_fn=) [2024-06-18 23:17:34,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3550.27 | bwd_microstep: 1890.94 | bwd_inner_microstep: 1885.92 | bwd_allreduce_microstep: 4.92 | step_microstep: 0.09 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8511, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8360, device='cuda:0', grad_fn=) [2024-06-18 23:17:40,001] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:17:40,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3548.88 | bwd_microstep: 1885.77 | bwd_inner_microstep: 1880.30 | bwd_allreduce_microstep: 5.36 | step_microstep: 61.18 [2024-06-18 23:17:40,002] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7099.13 | bwd: 3776.71 | bwd_inner: 3766.23 | bwd_allreduce: 10.28 | step: 61.28 57%|█████▋ | 343/600 [1:02:33<45:56, 10.73s/it] {'loss': 0.9334, 'learning_rate': 4.087495521552331e-05, 'epoch': 3.43} 57%|█████▋ | 343/600 [1:02:33<45:56, 10.73s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4748, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.4976, device='cuda:0', grad_fn=) [2024-06-18 23:17:45,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.07 | bwd_microstep: 1889.79 | bwd_inner_microstep: 1884.72 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6412, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.6467, device='cuda:0', grad_fn=) [2024-06-18 23:17:51,130] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:17:51,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3548.09 | bwd_microstep: 1887.91 | bwd_inner_microstep: 1882.36 | bwd_allreduce_microstep: 5.37 | step_microstep: 60.94 [2024-06-18 23:17:51,131] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7097.14 | bwd: 3777.69 | bwd_inner: 3767.17 | bwd_allreduce: 10.25 | step: 61.02 57%|█████▋ | 344/600 [1:02:44<46:16, 10.85s/it] {'loss': 0.5721, 'learning_rate': 4.06097258912665e-05, 'epoch': 3.44} 57%|█████▋ | 344/600 [1:02:44<46:16, 10.85s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6525, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6575, device='cuda:0', grad_fn=) [2024-06-18 23:17:56,711] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3558.56 | bwd_microstep: 1919.12 | bwd_inner_microstep: 1914.21 | bwd_allreduce_microstep: 4.79 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.8665, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.8390, device='cuda:0', grad_fn=) [2024-06-18 23:18:01,761] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:18:01,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2979.99 | bwd_microstep: 1904.07 | bwd_inner_microstep: 1898.44 | bwd_allreduce_microstep: 5.51 | step_microstep: 61.59 [2024-06-18 23:18:01,762] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6538.53 | bwd: 3823.19 | bwd_inner: 3812.68 | bwd_allreduce: 10.31 | step: 61.67 57%|█████▊ | 345/600 [1:02:54<45:49, 10.78s/it] {'loss': 0.7482, 'learning_rate': 4.0344770176395606e-05, 'epoch': 3.45} 57%|█████▊ | 345/600 [1:02:54<45:49, 10.78s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0843, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1573, device='cuda:0', grad_fn=) [2024-06-18 23:18:07,042] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3465.32 | bwd_microstep: 1725.09 | bwd_inner_microstep: 1720.00 | bwd_allreduce_microstep: 4.99 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5799, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.5918, device='cuda:0', grad_fn=) [2024-06-18 23:18:12,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 23:18:12,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3552.61 | bwd_microstep: 1914.87 | bwd_inner_microstep: 1909.27 | bwd_allreduce_microstep: 5.41 | step_microstep: 61.31 [2024-06-18 23:18:12,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7017.89 | bwd: 3639.95 | bwd_inner: 3629.32 | bwd_allreduce: 10.39 | step: 61.39 58%|█████▊ | 346/600 [1:03:05<45:48, 10.82s/it] {'loss': 0.3746, 'learning_rate': 4.0080095791065505e-05, 'epoch': 3.46} 58%|█████▊ | 346/600 [1:03:05<45:48, 10.82s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5417, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.5574, device='cuda:0', grad_fn=) [2024-06-18 23:18:18,213] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3550.91 | bwd_microstep: 1891.65 | bwd_inner_microstep: 1886.46 | bwd_allreduce_microstep: 5.01 | step_microstep: 0.14 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7149, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7022, device='cuda:0', grad_fn=) [2024-06-18 23:18:23,838] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:18:23,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.00 | bwd_microstep: 1906.97 | bwd_inner_microstep: 1901.41 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.19 [2024-06-18 23:18:23,839] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7106.89 | bwd: 3798.60 | bwd_inner: 3787.98 | bwd_allreduce: 10.37 | step: 61.33 58%|█████▊ | 347/600 [1:03:16<46:03, 10.92s/it] {'loss': 0.6298, 'learning_rate': 3.9815710447233836e-05, 'epoch': 3.47} 58%|█████▊ | 347/600 [1:03:16<46:03, 10.92s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.5896, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.5894, device='cuda:0', grad_fn=) [2024-06-18 23:18:28,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2959.09 | bwd_microstep: 1865.39 | bwd_inner_microstep: 1860.32 | bwd_allreduce_microstep: 4.93 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8236, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8000, device='cuda:0', grad_fn=) [2024-06-18 23:18:34,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:18:34,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3555.03 | bwd_microstep: 1924.59 | bwd_inner_microstep: 1918.92 | bwd_allreduce_microstep: 5.50 | step_microstep: 62.00 [2024-06-18 23:18:34,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6514.10 | bwd: 3789.97 | bwd_inner: 3779.32 | bwd_allreduce: 10.42 | step: 62.08 58%|█████▊ | 348/600 [1:03:27<45:26, 10.82s/it] {'loss': 0.6947, 'learning_rate': 3.955162184843625e-05, 'epoch': 3.48} 58%|█████▊ | 348/600 [1:03:27<45:26, 10.82s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5996, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.6095, device='cuda:0', grad_fn=) [2024-06-18 23:18:40,043] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3576.37 | bwd_microstep: 1951.51 | bwd_inner_microstep: 1946.51 | bwd_allreduce_microstep: 4.83 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7850, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.7760, device='cuda:0', grad_fn=) [2024-06-18 23:18:44,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:18:44,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2648.75 | bwd_microstep: 1613.43 | bwd_inner_microstep: 1607.99 | bwd_allreduce_microstep: 5.33 | step_microstep: 61.13 [2024-06-18 23:18:44,457] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6225.11 | bwd: 3564.93 | bwd_inner: 3554.55 | bwd_allreduce: 10.17 | step: 61.21 58%|█████▊ | 349/600 [1:03:37<44:17, 10.59s/it] {'loss': 0.6928, 'learning_rate': 3.9287837689562016e-05, 'epoch': 3.49} 58%|█████▊ | 349/600 [1:03:37<44:17, 10.59s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4460, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.4709, device='cuda:0', grad_fn=) [2024-06-18 23:18:50,048] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3562.36 | bwd_microstep: 1923.07 | bwd_inner_microstep: 1917.87 | bwd_allreduce_microstep: 5.05 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0054, device='cuda:0', grad_fn=) tensor(0.8190, device='cuda:0', grad_fn=) tensor(0.0868, device='cuda:0', grad_fn=) [2024-06-18 23:18:55,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:18:55,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3414.11 | bwd_microstep: 1639.12 | bwd_inner_microstep: 1633.73 | bwd_allreduce_microstep: 5.27 | step_microstep: 61.53 [2024-06-18 23:18:55,248] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6976.46 | bwd: 3562.18 | bwd_inner: 3551.64 | bwd_allreduce: 10.32 | step: 61.61 58%|█████▊ | 350/600 [1:03:48<44:21, 10.65s/it] {'loss': 0.2789, 'learning_rate': 3.902436565662977e-05, 'epoch': 3.5} 58%|█████▊ | 350/600 [1:03:48<44:21, 10.65s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0088, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0890, device='cuda:0', grad_fn=) [2024-06-18 23:18:59,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2700.85 | bwd_microstep: 1720.29 | bwd_inner_microstep: 1715.27 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5602, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.5629, device='cuda:0', grad_fn=) [2024-06-18 23:19:05,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:19:05,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3563.23 | bwd_microstep: 1935.64 | bwd_inner_microstep: 1930.13 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.33 [2024-06-18 23:19:05,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6264.03 | bwd: 3655.93 | bwd_inner: 3645.41 | bwd_allreduce: 10.32 | step: 61.41 58%|█████▊ | 351/600 [1:03:58<43:36, 10.51s/it] {'loss': 0.326, 'learning_rate': 3.876121342656355e-05, 'epoch': 3.51} 58%|█████▊ | 351/600 [1:03:58<43:36, 10.51s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9717, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.9445, device='cuda:0', grad_fn=) [2024-06-18 23:19:11,067] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3576.43 | bwd_microstep: 1957.54 | bwd_inner_microstep: 1952.41 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4917, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5121, device='cuda:0', grad_fn=) [2024-06-18 23:19:16,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:19:16,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.04 | bwd_microstep: 1893.10 | bwd_inner_microstep: 1887.55 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.11 [2024-06-18 23:19:16,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7125.45 | bwd: 3850.62 | bwd_inner: 3840.01 | bwd_allreduce: 10.37 | step: 61.19 59%|█████▊ | 352/600 [1:04:09<44:20, 10.73s/it] {'loss': 0.7283, 'learning_rate': 3.849838866696913e-05, 'epoch': 3.52} 59%|█████▊ | 352/600 [1:04:09<44:20, 10.73s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0602, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.1349, device='cuda:0', grad_fn=) [2024-06-18 23:19:21,971] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3472.85 | bwd_microstep: 1739.67 | bwd_inner_microstep: 1734.70 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6218, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.6187, device='cuda:0', grad_fn=) [2024-06-18 23:19:27,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 23:19:27,590] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3551.44 | bwd_microstep: 1905.45 | bwd_inner_microstep: 1899.90 | bwd_allreduce_microstep: 5.44 | step_microstep: 63.15 [2024-06-18 23:19:27,591] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7024.27 | bwd: 3645.12 | bwd_inner: 3634.62 | bwd_allreduce: 10.30 | step: 63.22 59%|█████▉ | 353/600 [1:04:20<44:24, 10.79s/it] {'loss': 0.3768, 'learning_rate': 3.823589903591063e-05, 'epoch': 3.53} 59%|█████▉ | 353/600 [1:04:20<44:24, 10.79s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6784, device='cuda:0', grad_fn=) tensor(0.7023, device='cuda:0', grad_fn=) tensor(0.6808, device='cuda:0', grad_fn=) [2024-06-18 23:19:33,180] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3558.06 | bwd_microstep: 1927.01 | bwd_inner_microstep: 1922.08 | bwd_allreduce_microstep: 4.82 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7131, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.7009, device='cuda:0', grad_fn=) [2024-06-18 23:19:38,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.98 [2024-06-18 23:19:38,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3580.64 | bwd_microstep: 1968.33 | bwd_inner_microstep: 1962.76 | bwd_allreduce_microstep: 5.35 | step_microstep: 65.73 [2024-06-18 23:19:38,900] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7138.68 | bwd: 3895.33 | bwd_inner: 3884.91 | bwd_allreduce: 10.17 | step: 65.81 59%|█████▉ | 354/600 [1:04:32<44:51, 10.94s/it] {'loss': 0.6908, 'learning_rate': 3.7973752181687335e-05, 'epoch': 3.54} 59%|█████▉ | 354/600 [1:04:32<44:51, 10.94s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6383, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6448, device='cuda:0', grad_fn=) [2024-06-18 23:19:44,450] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3551.54 | bwd_microstep: 1898.20 | bwd_inner_microstep: 1893.14 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6731, device='cuda:0', grad_fn=) tensor(0.5946, device='cuda:0', grad_fn=) tensor(0.6652, device='cuda:0', grad_fn=) [2024-06-18 23:19:50,106] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.93 [2024-06-18 23:19:50,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.33 | bwd_microstep: 1933.23 | bwd_inner_microstep: 1927.62 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.68 [2024-06-18 23:19:50,107] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7110.83 | bwd: 3831.41 | bwd_inner: 3820.84 | bwd_allreduce: 10.26 | step: 61.76 59%|█████▉ | 355/600 [1:04:43<45:00, 11.02s/it] {'loss': 0.655, 'learning_rate': 3.771195574261084e-05, 'epoch': 3.55} 59%|█████▉ | 355/600 [1:04:43<45:00, 11.02s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5850, device='cuda:0', grad_fn=) tensor(0.6948, device='cuda:0', grad_fn=) tensor(0.5960, device='cuda:0', grad_fn=) [2024-06-18 23:19:55,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3546.07 | bwd_microstep: 1888.05 | bwd_inner_microstep: 1883.04 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.7380, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.7341, device='cuda:0', grad_fn=) [2024-06-18 23:20:00,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:20:00,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2755.49 | bwd_microstep: 1829.65 | bwd_inner_microstep: 1824.11 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.55 [2024-06-18 23:20:00,382] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6301.55 | bwd: 3717.69 | bwd_inner: 3707.19 | bwd_allreduce: 10.31 | step: 61.63 59%|█████▉ | 356/600 [1:04:53<43:54, 10.80s/it] {'loss': 0.665, 'learning_rate': 3.745051734678256e-05, 'epoch': 3.56} 59%|█████▉ | 356/600 [1:04:53<43:54, 10.80s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8383, device='cuda:0', grad_fn=) tensor(0.6948, device='cuda:0', grad_fn=) tensor(0.8239, device='cuda:0', grad_fn=) [2024-06-18 23:20:06,012] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3573.69 | bwd_microstep: 1952.86 | bwd_inner_microstep: 1947.72 | bwd_allreduce_microstep: 5.04 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5945, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.6046, device='cuda:0', grad_fn=) [2024-06-18 23:20:11,634] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:20:11,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3551.88 | bwd_microstep: 1911.66 | bwd_inner_microstep: 1906.14 | bwd_allreduce_microstep: 5.37 | step_microstep: 60.98 [2024-06-18 23:20:11,635] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7125.55 | bwd: 3864.52 | bwd_inner: 3853.92 | bwd_allreduce: 10.41 | step: 61.08 60%|█████▉ | 357/600 [1:05:04<44:17, 10.93s/it] {'loss': 0.7143, 'learning_rate': 3.718944461187138e-05, 'epoch': 3.57} 60%|█████▉ | 357/600 [1:05:04<44:17, 10.93s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0110, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0910, device='cuda:0', grad_fn=) [2024-06-18 23:20:17,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3492.00 | bwd_microstep: 1803.48 | bwd_inner_microstep: 1798.43 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.8354, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) [2024-06-18 23:20:21,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:20:21,792] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2763.21 | bwd_microstep: 1840.61 | bwd_inner_microstep: 1835.11 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.32 [2024-06-18 23:20:21,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6255.18 | bwd: 3644.09 | bwd_inner: 3633.60 | bwd_allreduce: 10.26 | step: 61.40 60%|█████▉ | 358/600 [1:05:14<43:09, 10.70s/it] {'loss': 0.4508, 'learning_rate': 3.692874514489173e-05, 'epoch': 3.58} 60%|█████▉ | 358/600 [1:05:14<43:09, 10.70s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7252, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7114, device='cuda:0', grad_fn=) [2024-06-18 23:20:27,394] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3558.72 | bwd_microstep: 1935.94 | bwd_inner_microstep: 1930.74 | bwd_allreduce_microstep: 5.08 | step_microstep: 0.10 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7339, device='cuda:0', grad_fn=) tensor(0.5946, device='cuda:0', grad_fn=) tensor(0.7200, device='cuda:0', grad_fn=) [2024-06-18 23:20:33,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.96 [2024-06-18 23:20:33,022] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3553.67 | bwd_microstep: 1909.14 | bwd_inner_microstep: 1903.50 | bwd_allreduce_microstep: 5.53 | step_microstep: 64.02 [2024-06-18 23:20:33,023] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7112.36 | bwd: 3845.07 | bwd_inner: 3834.26 | bwd_allreduce: 10.62 | step: 64.12 60%|█████▉ | 359/600 [1:05:26<43:37, 10.86s/it] {'loss': 0.7157, 'learning_rate': 3.666842654198191e-05, 'epoch': 3.59} 60%|█████▉ | 359/600 [1:05:26<43:37, 10.86s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9560, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.9195, device='cuda:0', grad_fn=) [2024-06-18 23:20:37,723] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2759.75 | bwd_microstep: 1836.31 | bwd_inner_microstep: 1831.37 | bwd_allreduce_microstep: 4.83 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0814, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.1539, device='cuda:0', grad_fn=) [2024-06-18 23:20:43,009] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 23:20:43,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3447.23 | bwd_microstep: 1691.62 | bwd_inner_microstep: 1686.01 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.59 [2024-06-18 23:20:43,010] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6206.96 | bwd: 3527.93 | bwd_inner: 3517.44 | bwd_allreduce: 10.23 | step: 61.67 60%|██████ | 360/600 [1:05:36<42:23, 10.60s/it] {'loss': 0.5367, 'learning_rate': 3.640849638818286e-05, 'epoch': 3.6} 60%|██████ | 360/600 [1:05:36<42:23, 10.60s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.0698, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.1435, device='cuda:0', grad_fn=) [2024-06-18 23:20:47,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2836.11 | bwd_microstep: 1631.23 | bwd_inner_microstep: 1626.32 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8964, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.8766, device='cuda:0', grad_fn=) [2024-06-18 23:20:53,159] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:20:53,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3546.70 | bwd_microstep: 1888.94 | bwd_inner_microstep: 1883.40 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.37 [2024-06-18 23:20:53,160] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6382.80 | bwd: 3520.17 | bwd_inner: 3509.73 | bwd_allreduce: 10.25 | step: 61.45 60%|██████ | 361/600 [1:05:46<41:40, 10.46s/it] {'loss': 0.5101, 'learning_rate': 3.614896225721699e-05, 'epoch': 3.61} 60%|██████ | 361/600 [1:05:46<41:40, 10.46s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6318, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6386, device='cuda:0', grad_fn=) [2024-06-18 23:20:58,801] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3578.47 | bwd_microstep: 1959.44 | bwd_inner_microstep: 1954.38 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7728, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.7546, device='cuda:0', grad_fn=) [2024-06-18 23:21:04,412] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:21:04,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3550.32 | bwd_microstep: 1902.73 | bwd_inner_microstep: 1897.19 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.24 [2024-06-18 23:21:04,413] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7128.77 | bwd: 3862.16 | bwd_inner: 3851.63 | bwd_allreduce: 10.29 | step: 61.32 60%|██████ | 362/600 [1:05:57<42:26, 10.70s/it] {'loss': 0.6966, 'learning_rate': 3.588983171126762e-05, 'epoch': 3.62} 60%|██████ | 362/600 [1:05:57<42:26, 10.70s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9562, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.9305, device='cuda:0', grad_fn=) [2024-06-18 23:21:09,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3555.15 | bwd_microstep: 1917.82 | bwd_inner_microstep: 1912.88 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0501, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(1.0039, device='cuda:0', grad_fn=) [2024-06-18 23:21:14,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:21:14,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2953.62 | bwd_microstep: 1866.06 | bwd_inner_microstep: 1860.58 | bwd_allreduce_microstep: 5.36 | step_microstep: 61.10 [2024-06-18 23:21:14,975] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6508.75 | bwd: 3783.88 | bwd_inner: 3773.48 | bwd_allreduce: 10.21 | step: 61.18 60%|██████ | 363/600 [1:06:08<42:06, 10.66s/it] {'loss': 0.9672, 'learning_rate': 3.5631112300758595e-05, 'epoch': 3.63} 60%|██████ | 363/600 [1:06:08<42:06, 10.66s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.2411, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.2866, device='cuda:0', grad_fn=) [2024-06-18 23:21:20,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3544.38 | bwd_microstep: 1880.29 | bwd_inner_microstep: 1875.24 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0304, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.1080, device='cuda:0', grad_fn=) [2024-06-18 23:21:25,862] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:21:25,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3476.45 | bwd_microstep: 1741.82 | bwd_inner_microstep: 1736.12 | bwd_allreduce_microstep: 5.58 | step_microstep: 61.79 [2024-06-18 23:21:25,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7020.81 | bwd: 3622.11 | bwd_inner: 3611.41 | bwd_allreduce: 10.44 | step: 61.87 61%|██████ | 364/600 [1:06:19<42:11, 10.73s/it] {'loss': 0.1973, 'learning_rate': 3.53728115641343e-05, 'epoch': 3.64} 61%|██████ | 364/600 [1:06:19<42:11, 10.73s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6594, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.6634, device='cuda:0', grad_fn=) [2024-06-18 23:21:31,454] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.61 | bwd_microstep: 1927.28 | bwd_inner_microstep: 1922.15 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) tensor(0.9058, device='cuda:0', grad_fn=) tensor(0.5946, device='cuda:0', grad_fn=) tensor(0.8746, device='cuda:0', grad_fn=) [2024-06-18 23:21:36,770] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:21:36,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3253.95 | bwd_microstep: 1898.37 | bwd_inner_microstep: 1892.88 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.00 [2024-06-18 23:21:36,771] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6813.52 | bwd: 3825.65 | bwd_inner: 3815.09 | bwd_allreduce: 10.33 | step: 61.08 61%|██████ | 365/600 [1:06:29<42:13, 10.78s/it] {'loss': 0.769, 'learning_rate': 3.5114937027639985e-05, 'epoch': 3.65} 61%|██████ | 365/600 [1:06:29<42:13, 10.78s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7845, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.7760, device='cuda:0', grad_fn=) [2024-06-18 23:21:42,423] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3581.71 | bwd_microstep: 1964.11 | bwd_inner_microstep: 1958.92 | bwd_allreduce_microstep: 5.00 | step_microstep: 0.13 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(1.2000, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(1.1391, device='cuda:0', grad_fn=) [2024-06-18 23:21:47,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:21:47,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2839.56 | bwd_microstep: 2006.90 | bwd_inner_microstep: 2001.44 | bwd_allreduce_microstep: 5.35 | step_microstep: 61.39 [2024-06-18 23:21:47,444] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6421.26 | bwd: 3971.00 | bwd_inner: 3960.42 | bwd_allreduce: 10.33 | step: 61.53 61%|██████ | 366/600 [1:06:40<41:55, 10.75s/it] {'loss': 0.9575, 'learning_rate': 3.4857496205102474e-05, 'epoch': 3.66} 61%|██████ | 366/600 [1:06:40<41:55, 10.75s/it]warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.0006, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0816, device='cuda:0', grad_fn=) [2024-06-18 23:21:51,199] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2384.59 | bwd_microstep: 1291.68 | bwd_inner_microstep: 1286.68 | bwd_allreduce_microstep: 4.83 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0186, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0974, device='cuda:0', grad_fn=) [2024-06-18 23:21:56,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:21:56,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3464.24 | bwd_microstep: 1726.16 | bwd_inner_microstep: 1720.58 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.63 [2024-06-18 23:21:56,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5848.82 | bwd: 3017.84 | bwd_inner: 3007.35 | bwd_allreduce: 10.18 | step: 61.71 61%|██████ | 367/600 [1:06:49<39:48, 10.25s/it] {'loss': 0.0895, 'learning_rate': 3.460049659771124e-05, 'epoch': 3.67} 61%|██████ | 367/600 [1:06:49<39:48, 10.25s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0170, device='cuda:0', grad_fn=) tensor(0.8025, device='cuda:0', grad_fn=) tensor(0.0956, device='cuda:0', grad_fn=) [2024-06-18 23:22:01,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3497.37 | bwd_microstep: 1808.91 | bwd_inner_microstep: 1803.68 | bwd_allreduce_microstep: 5.13 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9857, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.9574, device='cuda:0', grad_fn=) [2024-06-18 23:22:07,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:22:07,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3577.49 | bwd_microstep: 1959.38 | bwd_inner_microstep: 1953.87 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.03 [2024-06-18 23:22:07,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7074.82 | bwd: 3768.28 | bwd_inner: 3757.56 | bwd_allreduce: 10.53 | step: 61.11 61%|██████▏ | 368/600 [1:07:00<40:37, 10.51s/it] {'loss': 0.5265, 'learning_rate': 3.434394569379988e-05, 'epoch': 3.68} 61%|██████▏ | 368/600 [1:07:00<40:37, 10.51s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0007, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.9598, device='cuda:0', grad_fn=) [2024-06-18 23:22:13,349] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3598.35 | bwd_microstep: 1998.15 | bwd_inner_microstep: 1993.01 | bwd_allreduce_microstep: 5.04 | step_microstep: 0.10 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9488, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.9131, device='cuda:0', grad_fn=) [2024-06-18 23:22:19,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.94 [2024-06-18 23:22:19,123] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3594.20 | bwd_microstep: 2005.22 | bwd_inner_microstep: 1999.62 | bwd_allreduce_microstep: 5.42 | step_microstep: 62.72 [2024-06-18 23:22:19,123] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7192.50 | bwd: 4003.37 | bwd_inner: 3992.68 | bwd_allreduce: 10.46 | step: 62.82 62%|██████▏ | 369/600 [1:07:12<41:34, 10.80s/it] {'loss': 0.9364, 'learning_rate': 3.408785096862782e-05, 'epoch': 3.69} 62%|██████▏ | 369/600 [1:07:12<41:34, 10.80s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.0717, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.1452, device='cuda:0', grad_fn=) [2024-06-18 23:22:21,108] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1433.22 | bwd_microstep: 502.99 | bwd_inner_microstep: 497.97 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8260, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.8130, device='cuda:0', grad_fn=) [2024-06-18 23:22:26,797] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:22:26,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3571.78 | bwd_microstep: 1955.79 | bwd_inner_microstep: 1950.32 | bwd_allreduce_microstep: 5.36 | step_microstep: 61.37 [2024-06-18 23:22:26,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5004.98 | bwd: 2458.78 | bwd_inner: 2448.31 | bwd_allreduce: 10.27 | step: 61.45 62%|██████▏ | 370/600 [1:07:19<37:48, 9.86s/it] {'loss': 0.4791, 'learning_rate': 3.3832219884162585e-05, 'epoch': 3.7} 62%|██████▏ | 370/600 [1:07:19<37:48, 9.86s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0561, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1315, device='cuda:0', grad_fn=) [2024-06-18 23:22:31,417] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2860.64 | bwd_microstep: 1667.63 | bwd_inner_microstep: 1662.55 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9540, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.9174, device='cuda:0', grad_fn=) [2024-06-18 23:22:36,698] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:22:36,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3245.36 | bwd_microstep: 1873.96 | bwd_inner_microstep: 1868.36 | bwd_allreduce_microstep: 5.41 | step_microstep: 61.07 [2024-06-18 23:22:36,699] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6105.98 | bwd: 3541.59 | bwd_inner: 3531.00 | bwd_allreduce: 10.28 | step: 61.15 62%|██████▏ | 371/600 [1:07:29<37:41, 9.87s/it] {'loss': 0.5245, 'learning_rate': 3.3577059888862364e-05, 'epoch': 3.71} 62%|██████▏ | 371/600 [1:07:29<37:41, 9.87s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0481, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(1.0024, device='cuda:0', grad_fn=) [2024-06-18 23:22:42,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3594.38 | bwd_microstep: 2004.52 | bwd_inner_microstep: 1999.48 | bwd_allreduce_microstep: 4.93 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7131, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7006, device='cuda:0', grad_fn=) [2024-06-18 23:22:48,134] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:22:48,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3583.20 | bwd_microstep: 1973.27 | bwd_inner_microstep: 1967.82 | bwd_allreduce_microstep: 5.34 | step_microstep: 60.95 [2024-06-18 23:22:48,135] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7177.53 | bwd: 3977.79 | bwd_inner: 3967.31 | bwd_allreduce: 10.28 | step: 61.05 62%|██████▏ | 372/600 [1:07:41<39:18, 10.34s/it] {'loss': 0.8515, 'learning_rate': 3.332237841745898e-05, 'epoch': 3.72} 62%|██████▏ | 372/600 [1:07:41<39:18, 10.34s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7961, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.7756, device='cuda:0', grad_fn=) [2024-06-18 23:22:53,829] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3595.79 | bwd_microstep: 1985.15 | bwd_inner_microstep: 1980.18 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9153, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.8941, device='cuda:0', grad_fn=) [2024-06-18 23:22:59,430] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 23:22:59,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3547.76 | bwd_microstep: 1894.84 | bwd_inner_microstep: 1889.09 | bwd_allreduce_microstep: 5.63 | step_microstep: 64.01 [2024-06-18 23:22:59,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7143.49 | bwd: 3879.99 | bwd_inner: 3869.29 | bwd_allreduce: 10.49 | step: 64.09 62%|██████▏ | 373/600 [1:07:52<40:12, 10.63s/it] {'loss': 0.8348, 'learning_rate': 3.30681828907412e-05, 'epoch': 3.73} 62%|██████▏ | 373/600 [1:07:52<40:12, 10.63s/it]warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.0231, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1019, device='cuda:0', grad_fn=) [2024-06-18 23:23:03,243] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2405.11 | bwd_microstep: 1326.69 | bwd_inner_microstep: 1321.72 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7096, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.6971, device='cuda:0', grad_fn=) [2024-06-18 23:23:08,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:23:08,979] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3584.91 | bwd_microstep: 1982.23 | bwd_inner_microstep: 1976.68 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.66 [2024-06-18 23:23:08,980] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5989.99 | bwd: 3308.92 | bwd_inner: 3298.46 | bwd_allreduce: 10.24 | step: 61.74 62%|██████▏ | 374/600 [1:08:02<38:48, 10.30s/it] {'loss': 0.3995, 'learning_rate': 3.281448071533867e-05, 'epoch': 3.74} 62%|██████▏ | 374/600 [1:08:02<38:48, 10.30s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0666, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1410, device='cuda:0', grad_fn=) [2024-06-18 23:23:14,264] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3469.16 | bwd_microstep: 1725.90 | bwd_inner_microstep: 1720.81 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.4803, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5022, device='cuda:0', grad_fn=) [2024-06-18 23:23:18,906] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 23:23:18,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2844.80 | bwd_microstep: 1644.09 | bwd_inner_microstep: 1638.43 | bwd_allreduce_microstep: 5.56 | step_microstep: 62.25 [2024-06-18 23:23:18,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6313.94 | bwd: 3369.99 | bwd_inner: 3359.29 | bwd_allreduce: 10.46 | step: 62.33 62%|██████▎ | 375/600 [1:08:12<38:13, 10.19s/it] {'loss': 0.3216, 'learning_rate': 3.2561279283505883e-05, 'epoch': 3.75} 62%|██████▎ | 375/600 [1:08:12<38:13, 10.19s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8016, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.7806, device='cuda:0', grad_fn=) [2024-06-18 23:23:24,508] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3560.69 | bwd_microstep: 1932.46 | bwd_inner_microstep: 1927.44 | bwd_allreduce_microstep: 4.91 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3270, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.3534, device='cuda:0', grad_fn=) [2024-06-18 23:23:30,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:23:30,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3558.85 | bwd_microstep: 1926.65 | bwd_inner_microstep: 1921.08 | bwd_allreduce_microstep: 5.37 | step_microstep: 60.98 [2024-06-18 23:23:30,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7119.53 | bwd: 3859.10 | bwd_inner: 3848.59 | bwd_allreduce: 10.27 | step: 61.06 63%|██████▎ | 376/600 [1:08:23<39:13, 10.51s/it] {'loss': 0.567, 'learning_rate': 3.2308585972906966e-05, 'epoch': 3.76} 63%|██████▎ | 376/600 [1:08:23<39:13, 10.51s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9072, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.8868, device='cuda:0', grad_fn=) [2024-06-18 23:23:35,736] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.12 | bwd_microstep: 1917.59 | bwd_inner_microstep: 1912.57 | bwd_allreduce_microstep: 4.83 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1024) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1024, 6144]) tensor(0.6312, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.6376, device='cuda:0', grad_fn=) [2024-06-18 23:23:40,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:23:40,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3014.10 | bwd_microstep: 1697.40 | bwd_inner_microstep: 1691.74 | bwd_allreduce_microstep: 5.55 | step_microstep: 61.17 [2024-06-18 23:23:40,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6573.17 | bwd: 3614.99 | bwd_inner: 3604.37 | bwd_allreduce: 10.37 | step: 61.25 63%|██████▎ | 377/600 [1:08:33<38:59, 10.49s/it] {'loss': 0.7622, 'learning_rate': 3.2056408146400614e-05, 'epoch': 3.77} 63%|██████▎ | 377/600 [1:08:33<38:59, 10.49s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.6844, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.6859, device='cuda:0', grad_fn=) [2024-06-18 23:23:42,793] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1485.36 | bwd_microstep: 639.37 | bwd_inner_microstep: 634.43 | bwd_allreduce_microstep: 4.74 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9187, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8856, device='cuda:0', grad_fn=) [2024-06-18 23:23:48,433] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:23:48,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3551.27 | bwd_microstep: 1926.36 | bwd_inner_microstep: 1920.81 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.18 [2024-06-18 23:23:48,434] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5036.60 | bwd: 2565.72 | bwd_inner: 2555.30 | bwd_allreduce: 10.15 | step: 61.26 63%|██████▎ | 378/600 [1:08:41<35:51, 9.69s/it] {'loss': 0.7857, 'learning_rate': 3.180475315182563e-05, 'epoch': 3.78} 63%|██████▎ | 378/600 [1:08:41<35:51, 9.69s/it]warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6601, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6640, device='cuda:0', grad_fn=) [2024-06-18 23:23:53,421] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3174.90 | bwd_microstep: 1714.09 | bwd_inner_microstep: 1709.14 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1807, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.2325, device='cuda:0', grad_fn=) [2024-06-18 23:23:59,049] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:23:59,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3553.55 | bwd_microstep: 1915.16 | bwd_inner_microstep: 1909.41 | bwd_allreduce_microstep: 5.58 | step_microstep: 61.61 [2024-06-18 23:23:59,050] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6728.44 | bwd: 3629.24 | bwd_inner: 3618.60 | bwd_allreduce: 10.42 | step: 61.69 63%|██████▎ | 379/600 [1:08:52<36:43, 9.97s/it] {'loss': 0.4483, 'learning_rate': 3.1553628321786745e-05, 'epoch': 3.79} 63%|██████▎ | 379/600 [1:08:52<36:43, 9.97s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6668, device='cuda:0', grad_fn=) tensor(0.6948, device='cuda:0', grad_fn=) tensor(0.6696, device='cuda:0', grad_fn=) [2024-06-18 23:24:04,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3552.24 | bwd_microstep: 1892.55 | bwd_inner_microstep: 1887.50 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0017, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0830, device='cuda:0', grad_fn=) [2024-06-18 23:24:10,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:24:10,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3501.72 | bwd_microstep: 1807.20 | bwd_inner_microstep: 1801.70 | bwd_allreduce_microstep: 5.32 | step_microstep: 61.21 [2024-06-18 23:24:10,058] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7053.95 | bwd: 3699.74 | bwd_inner: 3689.28 | bwd_allreduce: 10.20 | step: 61.29 63%|██████▎ | 380/600 [1:09:03<37:41, 10.28s/it] {'loss': 0.3763, 'learning_rate': 3.130304097344103e-05, 'epoch': 3.8} 63%|██████▎ | 380/600 [1:09:03<37:41, 10.28s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0132, device='cuda:0', grad_fn=) tensor(0.8025, device='cuda:0', grad_fn=) tensor(0.0922, device='cuda:0', grad_fn=) [2024-06-18 23:24:15,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3507.12 | bwd_microstep: 1740.88 | bwd_inner_microstep: 1735.72 | bwd_allreduce_microstep: 4.98 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6426, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.6375, device='cuda:0', grad_fn=) [2024-06-18 23:24:21,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:24:21,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3578.99 | bwd_microstep: 1933.10 | bwd_inner_microstep: 1927.50 | bwd_allreduce_microstep: 5.48 | step_microstep: 61.88 [2024-06-18 23:24:21,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7086.09 | bwd: 3673.97 | bwd_inner: 3663.28 | bwd_allreduce: 10.45 | step: 61.96 64%|██████▎ | 381/600 [1:09:14<38:20, 10.50s/it] {'loss': 0.3648, 'learning_rate': 3.105299840828466e-05, 'epoch': 3.81} 64%|██████▎ | 381/600 [1:09:14<38:20, 10.50s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0005, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0815, device='cuda:0', grad_fn=) [2024-06-18 23:24:26,476] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3494.82 | bwd_microstep: 1807.82 | bwd_inner_microstep: 1802.78 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5356, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.5408, device='cuda:0', grad_fn=) [2024-06-18 23:24:32,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:24:32,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3557.23 | bwd_microstep: 1924.19 | bwd_inner_microstep: 1918.56 | bwd_allreduce_microstep: 5.46 | step_microstep: 61.43 [2024-06-18 23:24:32,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7052.02 | bwd: 3732.01 | bwd_inner: 3721.43 | bwd_allreduce: 10.28 | step: 61.50 64%|██████▎ | 382/600 [1:09:25<38:44, 10.66s/it] {'loss': 0.3112, 'learning_rate': 3.080350791194019e-05, 'epoch': 3.82} 64%|██████▎ | 382/600 [1:09:25<38:44, 10.66s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0012, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0821, device='cuda:0', grad_fn=) [2024-06-18 23:24:36,547] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2680.73 | bwd_microstep: 1654.10 | bwd_inner_microstep: 1649.07 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0006, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0816, device='cuda:0', grad_fn=) [2024-06-18 23:24:41,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.81 [2024-06-18 23:24:41,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3413.73 | bwd_microstep: 1639.54 | bwd_inner_microstep: 1633.92 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.34 [2024-06-18 23:24:41,748] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6094.44 | bwd: 3293.64 | bwd_inner: 3283.08 | bwd_allreduce: 10.29 | step: 61.42 64%|██████▍ | 383/600 [1:09:34<37:26, 10.35s/it] {'loss': 0.0819, 'learning_rate': 3.055457675394423e-05, 'epoch': 3.83} 64%|██████▍ | 383/600 [1:09:34<37:26, 10.35s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0284, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.1062, device='cuda:0', grad_fn=) [2024-06-18 23:24:47,057] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3476.61 | bwd_microstep: 1740.44 | bwd_inner_microstep: 1735.05 | bwd_allreduce_microstep: 5.27 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.4004, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.4299, device='cuda:0', grad_fn=) [2024-06-18 23:24:51,704] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:24:51,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2843.77 | bwd_microstep: 1649.09 | bwd_inner_microstep: 1643.49 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.30 [2024-06-18 23:24:51,705] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6320.34 | bwd: 3389.52 | bwd_inner: 3378.59 | bwd_allreduce: 10.67 | step: 61.39 64%|██████▍ | 384/600 [1:09:44<36:50, 10.23s/it] {'loss': 0.268, 'learning_rate': 3.0306212187535653e-05, 'epoch': 3.84} 64%|██████▍ | 384/600 [1:09:44<36:50, 10.23s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9905, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.9614, device='cuda:0', grad_fn=) [2024-06-18 23:24:57,255] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3551.34 | bwd_microstep: 1898.97 | bwd_inner_microstep: 1893.87 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5971, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.5962, device='cuda:0', grad_fn=) [2024-06-18 23:25:02,786] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:25:02,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3522.12 | bwd_microstep: 1848.07 | bwd_inner_microstep: 1842.54 | bwd_allreduce_microstep: 5.37 | step_microstep: 61.01 [2024-06-18 23:25:02,787] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7073.44 | bwd: 3747.04 | bwd_inner: 3736.50 | bwd_allreduce: 10.31 | step: 61.09 64%|██████▍ | 385/600 [1:09:55<37:35, 10.49s/it] {'loss': 0.7788, 'learning_rate': 3.005842144944425e-05, 'epoch': 3.85} 64%|██████▍ | 385/600 [1:09:55<37:35, 10.49s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6975, device='cuda:0', grad_fn=) tensor(0.6948, device='cuda:0', grad_fn=) tensor(0.6972, device='cuda:0', grad_fn=) [2024-06-18 23:25:08,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3560.68 | bwd_microstep: 1921.81 | bwd_inner_microstep: 1916.71 | bwd_allreduce_microstep: 4.97 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7256, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7119, device='cuda:0', grad_fn=) [2024-06-18 23:25:14,031] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:25:14,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3563.56 | bwd_microstep: 1930.32 | bwd_inner_microstep: 1924.71 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.36 [2024-06-18 23:25:14,032] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7124.23 | bwd: 3852.12 | bwd_inner: 3841.48 | bwd_allreduce: 10.37 | step: 61.45 64%|██████▍ | 386/600 [1:10:07<38:13, 10.72s/it] {'loss': 0.7045, 'learning_rate': 2.9811211759679924e-05, 'epoch': 3.86} 64%|██████▍ | 386/600 [1:10:07<38:13, 10.72s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0015, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0820, device='cuda:0', grad_fn=) [2024-06-18 23:25:18,592] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2838.96 | bwd_microstep: 1630.34 | bwd_inner_microstep: 1625.28 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8117, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.8008, device='cuda:0', grad_fn=) [2024-06-18 23:25:24,452] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:25:24,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3731.86 | bwd_microstep: 1964.28 | bwd_inner_microstep: 1958.77 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.53 [2024-06-18 23:25:24,453] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6570.80 | bwd: 3594.61 | bwd_inner: 3584.10 | bwd_allreduce: 10.27 | step: 61.62 64%|██████▍ | 387/600 [1:10:17<37:43, 10.63s/it] {'loss': 0.4414, 'learning_rate': 2.9564590321322207e-05, 'epoch': 3.87} 64%|██████▍ | 387/600 [1:10:17<37:43, 10.63s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3537, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.3879, device='cuda:0', grad_fn=) [2024-06-18 23:25:29,982] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3546.88 | bwd_microstep: 1884.57 | bwd_inner_microstep: 1879.62 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9859, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.9457, device='cuda:0', grad_fn=) [2024-06-18 23:25:35,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:25:35,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3564.46 | bwd_microstep: 1932.81 | bwd_inner_microstep: 1927.14 | bwd_allreduce_microstep: 5.55 | step_microstep: 61.95 [2024-06-18 23:25:35,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7111.33 | bwd: 3817.37 | bwd_inner: 3806.78 | bwd_allreduce: 10.40 | step: 62.04 65%|██████▍ | 388/600 [1:10:28<38:08, 10.80s/it] {'loss': 0.6668, 'learning_rate': 2.9318564320310444e-05, 'epoch': 3.88} 65%|██████▍ | 388/600 [1:10:28<38:08, 10.80s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0024, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0833, device='cuda:0', grad_fn=) [2024-06-18 23:25:40,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3475.97 | bwd_microstep: 1741.26 | bwd_inner_microstep: 1736.34 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8987, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.8787, device='cuda:0', grad_fn=) [2024-06-18 23:25:46,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 23:25:46,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3552.74 | bwd_microstep: 1900.80 | bwd_inner_microstep: 1895.13 | bwd_allreduce_microstep: 5.50 | step_microstep: 61.69 [2024-06-18 23:25:46,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7028.68 | bwd: 3642.06 | bwd_inner: 3631.52 | bwd_allreduce: 10.31 | step: 61.77 65%|██████▍ | 389/600 [1:10:39<38:06, 10.83s/it] {'loss': 0.481, 'learning_rate': 2.907314092523442e-05, 'epoch': 3.89} 65%|██████▍ | 389/600 [1:10:39<38:06, 10.83s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5935, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6044, device='cuda:0', grad_fn=) [2024-06-18 23:25:50,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2649.49 | bwd_microstep: 1613.53 | bwd_inner_microstep: 1608.61 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0043, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0849, device='cuda:0', grad_fn=) [2024-06-18 23:25:56,219] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:25:56,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3449.57 | bwd_microstep: 1692.30 | bwd_inner_microstep: 1686.68 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.82 [2024-06-18 23:25:56,220] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6099.05 | bwd: 3305.82 | bwd_inner: 3295.36 | bwd_allreduce: 10.23 | step: 61.90 65%|██████▌ | 390/600 [1:10:49<36:40, 10.48s/it] {'loss': 0.3447, 'learning_rate': 2.882832728712551e-05, 'epoch': 3.9} 65%|██████▌ | 390/600 [1:10:49<36:40, 10.48s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0186, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0982, device='cuda:0', grad_fn=) [2024-06-18 23:26:01,537] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3476.37 | bwd_microstep: 1745.27 | bwd_inner_microstep: 1740.36 | bwd_allreduce_microstep: 4.80 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0063, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0871, device='cuda:0', grad_fn=) [2024-06-18 23:26:06,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:26:06,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3475.98 | bwd_microstep: 1738.86 | bwd_inner_microstep: 1733.28 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.59 [2024-06-18 23:26:06,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6952.33 | bwd: 3484.11 | bwd_inner: 3473.69 | bwd_allreduce: 10.18 | step: 61.67 65%|██████▌ | 391/600 [1:11:00<36:43, 10.54s/it] {'loss': 0.0927, 'learning_rate': 2.8584130539248166e-05, 'epoch': 3.91} 65%|██████▌ | 391/600 [1:11:00<36:43, 10.54s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7692, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7623, device='cuda:0', grad_fn=) [2024-06-18 23:26:11,603] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2761.36 | bwd_microstep: 1833.56 | bwd_inner_microstep: 1828.64 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8531, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.8377, device='cuda:0', grad_fn=) [2024-06-18 23:26:17,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:26:17,238] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.64 | bwd_microstep: 1918.36 | bwd_inner_microstep: 1912.82 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.27 [2024-06-18 23:26:17,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6317.98 | bwd: 3751.91 | bwd_inner: 3741.48 | bwd_allreduce: 10.25 | step: 61.36 65%|██████▌ | 392/600 [1:11:10<36:19, 10.48s/it] {'loss': 0.8, 'learning_rate': 2.8340557796892354e-05, 'epoch': 3.92} 65%|██████▌ | 392/600 [1:11:10<36:19, 10.48s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5461, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.5614, device='cuda:0', grad_fn=) [2024-06-18 23:26:21,684] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2682.18 | bwd_microstep: 1663.71 | bwd_inner_microstep: 1658.80 | bwd_allreduce_microstep: 4.79 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0035, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0843, device='cuda:0', grad_fn=) [2024-06-18 23:26:27,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:26:27,155] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3506.06 | bwd_microstep: 1810.55 | bwd_inner_microstep: 1804.97 | bwd_allreduce_microstep: 5.46 | step_microstep: 61.87 [2024-06-18 23:26:27,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6188.22 | bwd: 3474.24 | bwd_inner: 3463.79 | bwd_allreduce: 10.26 | step: 61.95 66%|██████▌ | 393/600 [1:11:20<35:34, 10.31s/it] {'loss': 0.3228, 'learning_rate': 2.8097616157165883e-05, 'epoch': 3.93} 66%|██████▌ | 393/600 [1:11:20<35:34, 10.31s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6965, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6972, device='cuda:0', grad_fn=) [2024-06-18 23:26:32,798] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3576.02 | bwd_microstep: 1961.32 | bwd_inner_microstep: 1956.39 | bwd_allreduce_microstep: 4.83 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0286, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.9957, device='cuda:0', grad_fn=) [2024-06-18 23:26:38,500] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:26:38,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3580.88 | bwd_microstep: 1958.47 | bwd_inner_microstep: 1952.88 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.55 [2024-06-18 23:26:38,501] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7156.85 | bwd: 3919.79 | bwd_inner: 3909.30 | bwd_allreduce: 10.29 | step: 61.63 66%|██████▌ | 394/600 [1:11:31<36:27, 10.62s/it] {'loss': 0.8464, 'learning_rate': 2.7855312698787904e-05, 'epoch': 3.94} 66%|██████▌ | 394/600 [1:11:31<36:27, 10.62s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.0003, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0814, device='cuda:0', grad_fn=) [2024-06-18 23:26:43,225] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2888.74 | bwd_microstep: 1742.49 | bwd_inner_microstep: 1737.54 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0005, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0816, device='cuda:0', grad_fn=) [2024-06-18 23:26:48,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:26:48,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3478.81 | bwd_microstep: 1744.44 | bwd_inner_microstep: 1738.86 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.69 [2024-06-18 23:26:48,601] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6367.53 | bwd: 3486.93 | bwd_inner: 3476.45 | bwd_allreduce: 10.28 | step: 61.77 66%|██████▌ | 395/600 [1:11:41<35:45, 10.46s/it] {'loss': 0.0815, 'learning_rate': 2.761365448188253e-05, 'epoch': 3.95} 66%|██████▌ | 395/600 [1:11:41<35:45, 10.46s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7216, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7082, device='cuda:0', grad_fn=) [2024-06-18 23:26:54,165] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3552.02 | bwd_microstep: 1909.23 | bwd_inner_microstep: 1904.08 | bwd_allreduce_microstep: 5.02 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0034, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0838, device='cuda:0', grad_fn=) [2024-06-18 23:26:59,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:26:59,627] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3500.97 | bwd_microstep: 1807.95 | bwd_inner_microstep: 1802.46 | bwd_allreduce_microstep: 5.32 | step_microstep: 61.24 [2024-06-18 23:26:59,628] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7052.97 | bwd: 3717.18 | bwd_inner: 3706.60 | bwd_allreduce: 10.32 | step: 61.33 66%|██████▌ | 396/600 [1:11:52<36:09, 10.63s/it] {'loss': 0.396, 'learning_rate': 2.737264854777306e-05, 'epoch': 3.96} 66%|██████▌ | 396/600 [1:11:52<36:09, 10.63s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5532, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5674, device='cuda:0', grad_fn=) [2024-06-18 23:27:05,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3558.67 | bwd_microstep: 1920.06 | bwd_inner_microstep: 1915.10 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.2665, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(1.1986, device='cuda:0', grad_fn=) [2024-06-18 23:27:09,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:27:09,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2756.94 | bwd_microstep: 1812.84 | bwd_inner_microstep: 1807.39 | bwd_allreduce_microstep: 5.34 | step_microstep: 61.55 [2024-06-18 23:27:09,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6315.60 | bwd: 3732.89 | bwd_inner: 3722.50 | bwd_allreduce: 10.20 | step: 61.63 66%|██████▌ | 397/600 [1:12:03<35:38, 10.54s/it] {'loss': 0.883, 'learning_rate': 2.7132301918776977e-05, 'epoch': 3.97} 66%|██████▌ | 397/600 [1:12:03<35:38, 10.54s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0416, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1185, device='cuda:0', grad_fn=) [2024-06-18 23:27:15,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3494.85 | bwd_microstep: 1808.92 | bwd_inner_microstep: 1803.58 | bwd_allreduce_microstep: 5.21 | step_microstep: 0.10 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3640, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.3979, device='cuda:0', grad_fn=) [2024-06-18 23:27:20,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:27:20,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3550.08 | bwd_microstep: 1890.42 | bwd_inner_microstep: 1884.91 | bwd_allreduce_microstep: 5.40 | step_microstep: 62.04 [2024-06-18 23:27:20,939] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7044.90 | bwd: 3699.34 | bwd_inner: 3688.51 | bwd_allreduce: 10.61 | step: 62.15 66%|██████▋ | 398/600 [1:12:14<35:56, 10.68s/it] {'loss': 0.2582, 'learning_rate': 2.6892621598001156e-05, 'epoch': 3.98} 66%|██████▋ | 398/600 [1:12:14<35:56, 10.68s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5655, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5786, device='cuda:0', grad_fn=) [2024-06-18 23:27:26,513] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3557.54 | bwd_microstep: 1915.01 | bwd_inner_microstep: 1909.96 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8973, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.8775, device='cuda:0', grad_fn=) [2024-06-18 23:27:32,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:27:32,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3550.41 | bwd_microstep: 1896.67 | bwd_inner_microstep: 1891.12 | bwd_allreduce_microstep: 5.37 | step_microstep: 61.40 [2024-06-18 23:27:32,119] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7107.91 | bwd: 3811.67 | bwd_inner: 3801.17 | bwd_allreduce: 10.23 | step: 61.49 66%|██████▋ | 399/600 [1:12:25<36:16, 10.83s/it] {'loss': 0.728, 'learning_rate': 2.6653614569137968e-05, 'epoch': 3.99} 66%|██████▋ | 399/600 [1:12:25<36:16, 10.83s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7970, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7761, device='cuda:0', grad_fn=) [2024-06-18 23:27:36,779] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2753.51 | bwd_microstep: 1804.57 | bwd_inner_microstep: 1799.36 | bwd_allreduce_microstep: 5.10 | step_microstep: 0.14 please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6437, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6493, device='cuda:0', grad_fn=) [2024-06-18 23:27:43,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:27:43,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3551.11 | bwd_microstep: 1914.68 | bwd_inner_microstep: 1909.24 | bwd_allreduce_microstep: 5.33 | step_microstep: 61.78 [2024-06-18 23:27:43,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6304.56 | bwd: 3719.24 | bwd_inner: 3708.61 | bwd_allreduce: 10.44 | step: 61.93 67%|██████▋ | 400/600 [1:12:36<36:20, 10.90s/it] {'loss': 0.7127, 'learning_rate': 2.6415287796261706e-05, 'epoch': 4.0} 67%|██████▋ | 400/600 [1:12:36<36:20, 10.90s/it][2024-06-18 23:27:45,857] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-18 23:27:51,656] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-18 23:27:57,490] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-06-18 23:28:03,282] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0140, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0937, device='cuda:0', grad_fn=) [2024-06-18 23:28:12,052] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3482.96 | bwd_microstep: 1733.15 | bwd_inner_microstep: 1728.16 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.5428, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5581, device='cuda:0', grad_fn=) [2024-06-18 23:28:16,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:28:16,857] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2773.30 | bwd_microstep: 1869.64 | bwd_inner_microstep: 1863.93 | bwd_allreduce_microstep: 5.60 | step_microstep: 61.49 [2024-06-18 23:28:16,858] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6256.19 | bwd: 3602.78 | bwd_inner: 3592.11 | bwd_allreduce: 10.48 | step: 61.57 67%|██████▋ | 401/600 [1:13:10<58:48, 17.73s/it] {'loss': 0.3259, 'learning_rate': 2.617764822362563e-05, 'epoch': 4.01} 67%|██████▋ | 401/600 [1:13:10<58:48, 17.73s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6416, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6477, device='cuda:0', grad_fn=) [2024-06-18 23:28:22,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3573.96 | bwd_microstep: 1957.01 | bwd_inner_microstep: 1952.03 | bwd_allreduce_microstep: 4.82 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0043, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0853, device='cuda:0', grad_fn=) [2024-06-18 23:28:27,692] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.81 [2024-06-18 23:28:27,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3411.50 | bwd_microstep: 1640.05 | bwd_inner_microstep: 1634.59 | bwd_allreduce_microstep: 5.36 | step_microstep: 61.29 [2024-06-18 23:28:27,693] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6985.43 | bwd: 3597.05 | bwd_inner: 3586.65 | bwd_allreduce: 10.19 | step: 61.37 67%|██████▋ | 402/600 [1:13:20<51:40, 15.66s/it] {'loss': 0.3665, 'learning_rate': 2.5940702775459747e-05, 'epoch': 4.02} 67%|██████▋ | 402/600 [1:13:20<51:40, 15.66s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0003, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0814, device='cuda:0', grad_fn=) [2024-06-18 23:28:32,978] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3474.06 | bwd_microstep: 1722.51 | bwd_inner_microstep: 1717.31 | bwd_allreduce_microstep: 5.04 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8320, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.8184, device='cuda:0', grad_fn=) [2024-06-18 23:28:38,572] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.96 [2024-06-18 23:28:38,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3546.13 | bwd_microstep: 1890.54 | bwd_inner_microstep: 1885.01 | bwd_allreduce_microstep: 5.37 | step_microstep: 62.11 [2024-06-18 23:28:38,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7020.18 | bwd: 3613.05 | bwd_inner: 3602.42 | bwd_allreduce: 10.41 | step: 62.19 67%|██████▋ | 403/600 [1:13:31<46:42, 14.23s/it] {'loss': 0.4499, 'learning_rate': 2.5704458355768986e-05, 'epoch': 4.03} 67%|██████▋ | 403/600 [1:13:31<46:42, 14.23s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6145, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.6119, device='cuda:0', grad_fn=) [2024-06-18 23:28:44,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3562.32 | bwd_microstep: 1931.29 | bwd_inner_microstep: 1926.37 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.7054, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.7044, device='cuda:0', grad_fn=) [2024-06-18 23:28:48,968] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:28:48,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2771.87 | bwd_microstep: 1865.30 | bwd_inner_microstep: 1859.86 | bwd_allreduce_microstep: 5.32 | step_microstep: 60.89 [2024-06-18 23:28:48,969] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6334.18 | bwd: 3796.59 | bwd_inner: 3786.24 | bwd_allreduce: 10.14 | step: 60.97 67%|██████▋ | 404/600 [1:13:42<42:43, 13.08s/it] {'loss': 0.6581, 'learning_rate': 2.5468921848131983e-05, 'epoch': 4.04} 67%|██████▋ | 404/600 [1:13:42<42:43, 13.08s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0083, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0889, device='cuda:0', grad_fn=) [2024-06-18 23:28:54,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3470.28 | bwd_microstep: 1737.89 | bwd_inner_microstep: 1732.79 | bwd_allreduce_microstep: 4.92 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7832, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.7634, device='cuda:0', grad_fn=) [2024-06-18 23:28:59,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:28:59,244] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2950.07 | bwd_microstep: 1857.91 | bwd_inner_microstep: 1852.32 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.83 [2024-06-18 23:28:59,245] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6420.33 | bwd: 3595.79 | bwd_inner: 3585.19 | bwd_allreduce: 10.34 | step: 61.91 68%|██████▊ | 405/600 [1:13:52<39:46, 12.24s/it] {'loss': 0.4261, 'learning_rate': 2.5234100115500643e-05, 'epoch': 4.05} 68%|██████▊ | 405/600 [1:13:52<39:46, 12.24s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8660, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.8490, device='cuda:0', grad_fn=) [2024-06-18 23:29:04,836] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.36 | bwd_microstep: 1928.00 | bwd_inner_microstep: 1922.95 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7129, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.7008, device='cuda:0', grad_fn=) [2024-06-18 23:29:10,461] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:29:10,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.29 | bwd_microstep: 1907.77 | bwd_inner_microstep: 1902.28 | bwd_allreduce_microstep: 5.34 | step_microstep: 61.03 [2024-06-18 23:29:10,462] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7115.61 | bwd: 3835.76 | bwd_inner: 3825.31 | bwd_allreduce: 10.19 | step: 61.11 68%|██████▊ | 406/600 [1:14:03<38:34, 11.93s/it] {'loss': 0.7749, 'learning_rate': 2.500000000000001e-05, 'epoch': 4.06} 68%|██████▊ | 406/600 [1:14:03<38:34, 11.93s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7719, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.7642, device='cuda:0', grad_fn=) [2024-06-18 23:29:16,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.81 | bwd_microstep: 1917.58 | bwd_inner_microstep: 1912.51 | bwd_allreduce_microstep: 4.96 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5337, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.5502, device='cuda:0', grad_fn=) [2024-06-18 23:29:21,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:29:21,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3550.76 | bwd_microstep: 1895.49 | bwd_inner_microstep: 1889.92 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.47 [2024-06-18 23:29:21,644] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7107.55 | bwd: 3813.06 | bwd_inner: 3802.48 | bwd_allreduce: 10.36 | step: 61.55 68%|██████▊ | 407/600 [1:14:14<37:39, 11.71s/it] {'loss': 0.6572, 'learning_rate': 2.4766628322729064e-05, 'epoch': 4.07} 68%|██████▊ | 407/600 [1:14:14<37:39, 11.71s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6151, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.6235, device='cuda:0', grad_fn=) [2024-06-18 23:29:27,283] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3579.13 | bwd_microstep: 1957.02 | bwd_inner_microstep: 1951.84 | bwd_allreduce_microstep: 5.01 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0008, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0814, device='cuda:0', grad_fn=) [2024-06-18 23:29:32,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.95 [2024-06-18 23:29:32,750] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3499.47 | bwd_microstep: 1813.52 | bwd_inner_microstep: 1807.89 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.94 [2024-06-18 23:29:32,751] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7078.56 | bwd: 3770.53 | bwd_inner: 3759.83 | bwd_allreduce: 10.44 | step: 62.02 68%|██████▊ | 408/600 [1:14:25<36:53, 11.53s/it] {'loss': 0.3524, 'learning_rate': 2.4533991883561868e-05, 'epoch': 4.08} 68%|██████▊ | 408/600 [1:14:25<36:53, 11.53s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.3941, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(1.3250, device='cuda:0', grad_fn=) [2024-06-18 23:29:37,168] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2675.06 | bwd_microstep: 1648.31 | bwd_inner_microstep: 1643.42 | bwd_allreduce_microstep: 4.78 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6003, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.5994, device='cuda:0', grad_fn=) [2024-06-18 23:29:42,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:29:42,812] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3557.33 | bwd_microstep: 1924.66 | bwd_inner_microstep: 1919.02 | bwd_allreduce_microstep: 5.51 | step_microstep: 61.52 [2024-06-18 23:29:42,813] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6232.38 | bwd: 3572.96 | bwd_inner: 3562.46 | bwd_allreduce: 10.29 | step: 61.60 68%|██████▊ | 409/600 [1:14:35<35:17, 11.09s/it] {'loss': 0.9622, 'learning_rate': 2.430209746094943e-05, 'epoch': 4.09} 68%|██████▊ | 409/600 [1:14:35<35:17, 11.09s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4215, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.4489, device='cuda:0', grad_fn=) [2024-06-18 23:29:47,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2945.50 | bwd_microstep: 1849.61 | bwd_inner_microstep: 1844.71 | bwd_allreduce_microstep: 4.75 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7782, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7591, device='cuda:0', grad_fn=) [2024-06-18 23:29:53,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:29:53,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3565.10 | bwd_microstep: 1940.53 | bwd_inner_microstep: 1934.99 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.45 [2024-06-18 23:29:53,381] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6510.57 | bwd: 3790.14 | bwd_inner: 3779.79 | bwd_allreduce: 10.15 | step: 61.53 68%|██████▊ | 410/600 [1:14:46<34:36, 10.93s/it] {'loss': 0.604, 'learning_rate': 2.407095181172227e-05, 'epoch': 4.1} 68%|██████▊ | 410/600 [1:14:46<34:36, 10.93s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.5680, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.5815, device='cuda:0', grad_fn=) [2024-06-18 23:29:58,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2777.17 | bwd_microstep: 1866.26 | bwd_inner_microstep: 1861.33 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0014, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0828, device='cuda:0', grad_fn=) [2024-06-18 23:30:03,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:30:03,581] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3497.14 | bwd_microstep: 1806.88 | bwd_inner_microstep: 1801.33 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.63 [2024-06-18 23:30:03,582] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6274.30 | bwd: 3673.14 | bwd_inner: 3662.71 | bwd_allreduce: 10.19 | step: 61.71 68%|██████▊ | 411/600 [1:14:56<33:44, 10.71s/it] {'loss': 0.3321, 'learning_rate': 2.3840561670893496e-05, 'epoch': 4.11} 68%|██████▊ | 411/600 [1:14:56<33:44, 10.71s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5045, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5236, device='cuda:0', grad_fn=) [2024-06-18 23:30:09,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.51 | bwd_microstep: 1928.25 | bwd_inner_microstep: 1923.18 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7672, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.7608, device='cuda:0', grad_fn=) [2024-06-18 23:30:14,824] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:30:14,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3566.59 | bwd_microstep: 1923.16 | bwd_inner_microstep: 1917.52 | bwd_allreduce_microstep: 5.52 | step_microstep: 62.01 [2024-06-18 23:30:14,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7126.08 | bwd: 3851.40 | bwd_inner: 3840.76 | bwd_allreduce: 10.40 | step: 62.09 69%|██████▊ | 412/600 [1:15:07<34:03, 10.87s/it] {'loss': 0.6422, 'learning_rate': 2.3610933751462553e-05, 'epoch': 4.12} 69%|██████▊ | 412/600 [1:15:07<34:03, 10.87s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.0320, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1103, device='cuda:0', grad_fn=) [2024-06-18 23:30:19,169] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2645.77 | bwd_microstep: 1609.18 | bwd_inner_microstep: 1604.17 | bwd_allreduce_microstep: 4.83 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.8724, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.8555, device='cuda:0', grad_fn=) [2024-06-18 23:30:23,883] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:30:23,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2748.52 | bwd_microstep: 1809.13 | bwd_inner_microstep: 1803.55 | bwd_allreduce_microstep: 5.39 | step_microstep: 61.04 [2024-06-18 23:30:23,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5394.27 | bwd: 3418.31 | bwd_inner: 3407.82 | bwd_allreduce: 10.21 | step: 61.13 69%|██████▉ | 413/600 [1:15:17<32:11, 10.33s/it] {'loss': 0.4829, 'learning_rate': 2.3382074744219668e-05, 'epoch': 4.13} 69%|██████▉ | 413/600 [1:15:17<32:11, 10.33s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5022, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5220, device='cuda:0', grad_fn=) [2024-06-18 23:30:29,427] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3547.45 | bwd_microstep: 1895.33 | bwd_inner_microstep: 1890.25 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1318, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.1882, device='cuda:0', grad_fn=) [2024-06-18 23:30:35,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 23:30:35,007] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3544.60 | bwd_microstep: 1881.37 | bwd_inner_microstep: 1875.73 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.68 [2024-06-18 23:30:35,008] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7092.04 | bwd: 3776.70 | bwd_inner: 3766.09 | bwd_allreduce: 10.31 | step: 61.77 69%|██████▉ | 414/600 [1:15:28<32:45, 10.57s/it] {'loss': 0.3551, 'learning_rate': 2.315399131755081e-05, 'epoch': 4.14} 69%|██████▉ | 414/600 [1:15:28<32:45, 10.57s/it]warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5493, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5643, device='cuda:0', grad_fn=) [2024-06-18 23:30:40,208] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3240.40 | bwd_microstep: 1858.30 | bwd_inner_microstep: 1853.33 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6821, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.6838, device='cuda:0', grad_fn=) [2024-06-18 23:30:45,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:30:45,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3551.70 | bwd_microstep: 1889.69 | bwd_inner_microstep: 1884.20 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.58 [2024-06-18 23:30:45,806] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6792.08 | bwd: 3747.98 | bwd_inner: 3737.55 | bwd_allreduce: 10.23 | step: 61.66 69%|██████▉ | 415/600 [1:15:38<32:47, 10.64s/it] {'loss': 0.6241, 'learning_rate': 2.292669011724351e-05, 'epoch': 4.15} 69%|██████▉ | 415/600 [1:15:38<32:47, 10.64s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3231, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.3604, device='cuda:0', grad_fn=) [2024-06-18 23:30:50,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2747.51 | bwd_microstep: 1802.85 | bwd_inner_microstep: 1797.67 | bwd_allreduce_microstep: 5.06 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8403, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.8147, device='cuda:0', grad_fn=) [2024-06-18 23:30:56,126] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:30:56,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3565.29 | bwd_microstep: 1940.02 | bwd_inner_microstep: 1934.52 | bwd_allreduce_microstep: 5.34 | step_microstep: 61.33 [2024-06-18 23:30:56,127] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6312.76 | bwd: 3742.86 | bwd_inner: 3732.24 | bwd_allreduce: 10.41 | step: 61.41 69%|██████▉ | 416/600 [1:15:49<32:19, 10.54s/it] {'loss': 0.5875, 'learning_rate': 2.2700177766293096e-05, 'epoch': 4.16} 69%|██████▉ | 416/600 [1:15:49<32:19, 10.54s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5239, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.5418, device='cuda:0', grad_fn=) [2024-06-18 23:31:01,702] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3555.45 | bwd_microstep: 1917.32 | bwd_inner_microstep: 1912.26 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0184, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0977, device='cuda:0', grad_fn=) [2024-06-18 23:31:07,173] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:31:07,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3507.24 | bwd_microstep: 1810.96 | bwd_inner_microstep: 1805.32 | bwd_allreduce_microstep: 5.52 | step_microstep: 61.32 [2024-06-18 23:31:07,174] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7062.67 | bwd: 3728.27 | bwd_inner: 3717.62 | bwd_allreduce: 10.39 | step: 61.40 70%|██████▉ | 417/600 [1:16:00<32:36, 10.69s/it] {'loss': 0.3197, 'learning_rate': 2.2474460864709824e-05, 'epoch': 4.17} 70%|██████▉ | 417/600 [1:16:00<32:36, 10.69s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0138, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0935, device='cuda:0', grad_fn=) [2024-06-18 23:31:12,456] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3465.48 | bwd_microstep: 1726.26 | bwd_inner_microstep: 1721.27 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6528, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6578, device='cuda:0', grad_fn=) [2024-06-18 23:31:16,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:31:16,961] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2686.47 | bwd_microstep: 1661.26 | bwd_inner_microstep: 1655.83 | bwd_allreduce_microstep: 5.31 | step_microstep: 61.59 [2024-06-18 23:31:16,962] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6151.93 | bwd: 3387.50 | bwd_inner: 3377.12 | bwd_allreduce: 10.19 | step: 61.68 70%|██████▉ | 418/600 [1:16:10<31:36, 10.42s/it] {'loss': 0.3756, 'learning_rate': 2.2249545989326514e-05, 'epoch': 4.18} 70%|██████▉ | 418/600 [1:16:10<31:36, 10.42s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6970, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6973, device='cuda:0', grad_fn=) [2024-06-18 23:31:22,536] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3554.42 | bwd_microstep: 1917.55 | bwd_inner_microstep: 1912.50 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.3338, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.3708, device='cuda:0', grad_fn=) [2024-06-18 23:31:26,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:31:26,963] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2654.01 | bwd_microstep: 1620.64 | bwd_inner_microstep: 1615.12 | bwd_allreduce_microstep: 5.40 | step_microstep: 61.27 [2024-06-18 23:31:26,964] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6208.41 | bwd: 3538.19 | bwd_inner: 3527.68 | bwd_allreduce: 10.26 | step: 61.35 70%|██████▉ | 419/600 [1:16:20<31:03, 10.30s/it] {'loss': 0.534, 'learning_rate': 2.2025439693606882e-05, 'epoch': 4.19} 70%|██████▉ | 419/600 [1:16:20<31:03, 10.30s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3368, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.3734, device='cuda:0', grad_fn=) [2024-06-18 23:31:31,489] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2700.87 | bwd_microstep: 1724.51 | bwd_inner_microstep: 1719.56 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8155, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.7931, device='cuda:0', grad_fn=) [2024-06-18 23:31:37,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:31:37,151] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3564.16 | bwd_microstep: 1933.04 | bwd_inner_microstep: 1927.56 | bwd_allreduce_microstep: 5.37 | step_microstep: 61.28 [2024-06-18 23:31:37,152] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6265.00 | bwd: 3657.54 | bwd_inner: 3647.13 | bwd_allreduce: 10.23 | step: 61.36 70%|███████ | 420/600 [1:16:30<30:47, 10.26s/it] {'loss': 0.5832, 'learning_rate': 2.180214850745467e-05, 'epoch': 4.2} 70%|███████ | 420/600 [1:16:30<30:47, 10.26s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.4314, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.4579, device='cuda:0', grad_fn=) [2024-06-18 23:31:41,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2681.58 | bwd_microstep: 1664.77 | bwd_inner_microstep: 1659.74 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6317, device='cuda:0', grad_fn=) tensor(0.7039, device='cuda:0', grad_fn=) tensor(0.6389, device='cuda:0', grad_fn=) [2024-06-18 23:31:47,284] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:31:47,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3574.16 | bwd_microstep: 1952.99 | bwd_inner_microstep: 1947.40 | bwd_allreduce_microstep: 5.41 | step_microstep: 61.17 [2024-06-18 23:31:47,285] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6255.73 | bwd: 3617.75 | bwd_inner: 3607.24 | bwd_allreduce: 10.22 | step: 61.26 70%|███████ | 421/600 [1:16:40<30:30, 10.22s/it] {'loss': 0.5484, 'learning_rate': 2.1579678937023363e-05, 'epoch': 4.21} 70%|███████ | 421/600 [1:16:40<30:30, 10.22s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7826, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7743, device='cuda:0', grad_fn=) [2024-06-18 23:31:52,933] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3578.65 | bwd_microstep: 1964.04 | bwd_inner_microstep: 1958.90 | bwd_allreduce_microstep: 4.96 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8517, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.8250, device='cuda:0', grad_fn=) [2024-06-18 23:31:57,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:31:57,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2758.36 | bwd_microstep: 1817.67 | bwd_inner_microstep: 1812.00 | bwd_allreduce_microstep: 5.55 | step_microstep: 61.97 [2024-06-18 23:31:57,673] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6337.00 | bwd: 3781.70 | bwd_inner: 3770.95 | bwd_allreduce: 10.51 | step: 62.06 70%|███████ | 422/600 [1:16:50<30:28, 10.27s/it] {'loss': 0.7997, 'learning_rate': 2.1358037464526515e-05, 'epoch': 4.22} 70%|███████ | 422/600 [1:16:50<30:28, 10.27s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6736, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6766, device='cuda:0', grad_fn=) [2024-06-18 23:32:03,227] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3546.78 | bwd_microstep: 1908.47 | bwd_inner_microstep: 1903.46 | bwd_allreduce_microstep: 4.82 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7280, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.7255, device='cuda:0', grad_fn=) [2024-06-18 23:32:08,852] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 23:32:08,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3553.18 | bwd_microstep: 1913.64 | bwd_inner_microstep: 1908.11 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.30 [2024-06-18 23:32:08,853] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7099.95 | bwd: 3822.11 | bwd_inner: 3811.64 | bwd_allreduce: 10.22 | step: 61.39 70%|███████ | 423/600 [1:17:02<31:06, 10.55s/it] {'loss': 0.7011, 'learning_rate': 2.1137230548049043e-05, 'epoch': 4.23} 70%|███████ | 423/600 [1:17:02<31:06, 10.55s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8476, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8328, device='cuda:0', grad_fn=) [2024-06-18 23:32:14,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3552.78 | bwd_microstep: 1917.19 | bwd_inner_microstep: 1912.14 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6536, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6582, device='cuda:0', grad_fn=) [2024-06-18 23:32:20,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:32:20,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.21 | bwd_microstep: 1893.54 | bwd_inner_microstep: 1888.00 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.39 [2024-06-18 23:32:20,028] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7101.97 | bwd: 3810.72 | bwd_inner: 3800.19 | bwd_allreduce: 10.28 | step: 61.47 71%|███████ | 424/600 [1:17:13<31:29, 10.73s/it] {'loss': 0.7455, 'learning_rate': 2.091726462135888e-05, 'epoch': 4.24} 71%|███████ | 424/600 [1:17:13<31:29, 10.73s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.6850, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6865, device='cuda:0', grad_fn=) [2024-06-18 23:32:24,996] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2972.72 | bwd_microstep: 1892.68 | bwd_inner_microstep: 1887.56 | bwd_allreduce_microstep: 5.01 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0333, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1110, device='cuda:0', grad_fn=) [2024-06-18 23:32:29,999] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:32:30,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3161.06 | bwd_microstep: 1693.33 | bwd_inner_microstep: 1687.77 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.16 [2024-06-18 23:32:30,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6133.77 | bwd: 3586.00 | bwd_inner: 3575.34 | bwd_allreduce: 10.45 | step: 61.25 71%|███████ | 425/600 [1:17:23<30:38, 10.51s/it] {'loss': 0.3987, 'learning_rate': 2.0698146093719656e-05, 'epoch': 4.25} 71%|███████ | 425/600 [1:17:23<30:38, 10.51s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4642, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.4769, device='cuda:0', grad_fn=) [2024-06-18 23:32:35,600] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3562.93 | bwd_microstep: 1930.80 | bwd_inner_microstep: 1925.76 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6158, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.6131, device='cuda:0', grad_fn=) [2024-06-18 23:32:41,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 23:32:41,328] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3585.35 | bwd_microstep: 1975.33 | bwd_inner_microstep: 1969.73 | bwd_allreduce_microstep: 5.47 | step_microstep: 62.47 [2024-06-18 23:32:41,329] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7148.25 | bwd: 3906.12 | bwd_inner: 3895.56 | bwd_allreduce: 10.32 | step: 62.55 71%|███████ | 426/600 [1:17:34<31:10, 10.75s/it] {'loss': 0.545, 'learning_rate': 2.0479881349703883e-05, 'epoch': 4.26} 71%|███████ | 426/600 [1:17:34<31:10, 10.75s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.5788, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5909, device='cuda:0', grad_fn=) [2024-06-18 23:32:45,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2704.02 | bwd_microstep: 1730.11 | bwd_inner_microstep: 1724.90 | bwd_allreduce_microstep: 5.09 | step_microstep: 0.10 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.1134, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(1.0609, device='cuda:0', grad_fn=) [2024-06-18 23:32:51,657] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:32:51,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3600.17 | bwd_microstep: 2015.25 | bwd_inner_microstep: 2009.70 | bwd_allreduce_microstep: 5.37 | step_microstep: 61.84 [2024-06-18 23:32:51,658] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6304.14 | bwd: 3745.35 | bwd_inner: 3734.65 | bwd_allreduce: 10.44 | step: 61.94 71%|███████ | 427/600 [1:17:44<30:38, 10.63s/it] {'loss': 0.8259, 'learning_rate': 2.0262476749006877e-05, 'epoch': 4.27} 71%|███████ | 427/600 [1:17:44<30:38, 10.63s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1353, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.1917, device='cuda:0', grad_fn=) [2024-06-18 23:32:57,300] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3577.69 | bwd_microstep: 1959.27 | bwd_inner_microstep: 1954.24 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4811, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.4921, device='cuda:0', grad_fn=) [2024-06-18 23:33:02,924] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 23:33:02,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.04 | bwd_microstep: 1908.05 | bwd_inner_microstep: 1902.53 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.70 [2024-06-18 23:33:02,925] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7133.69 | bwd: 3867.31 | bwd_inner: 3856.82 | bwd_allreduce: 10.26 | step: 61.78 71%|███████▏ | 428/600 [1:17:56<31:00, 10.82s/it] {'loss': 0.3419, 'learning_rate': 2.0045938626261546e-05, 'epoch': 4.28} 71%|███████▏ | 428/600 [1:17:56<31:00, 10.82s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1369, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.1935, device='cuda:0', grad_fn=) [2024-06-18 23:33:08,573] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3579.65 | bwd_microstep: 1962.79 | bwd_inner_microstep: 1957.58 | bwd_allreduce_microstep: 5.03 | step_microstep: 0.14 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6643, device='cuda:0', grad_fn=) tensor(0.7039, device='cuda:0', grad_fn=) tensor(0.6682, device='cuda:0', grad_fn=) [2024-06-18 23:33:14,166] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:33:14,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3547.35 | bwd_microstep: 1888.15 | bwd_inner_microstep: 1882.54 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.20 [2024-06-18 23:33:14,167] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7126.96 | bwd: 3850.93 | bwd_inner: 3840.20 | bwd_allreduce: 10.43 | step: 61.34 72%|███████▏ | 429/600 [1:18:07<31:11, 10.95s/it] {'loss': 0.4308, 'learning_rate': 1.983027329085377e-05, 'epoch': 4.29} 72%|███████▏ | 429/600 [1:18:07<31:11, 10.95s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5881, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.5992, device='cuda:0', grad_fn=) [2024-06-18 23:33:19,808] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3578.94 | bwd_microstep: 1958.94 | bwd_inner_microstep: 1953.95 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6352, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6419, device='cuda:0', grad_fn=) [2024-06-18 23:33:25,419] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:33:25,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3550.92 | bwd_microstep: 1901.27 | bwd_inner_microstep: 1895.60 | bwd_allreduce_microstep: 5.56 | step_microstep: 61.73 [2024-06-18 23:33:25,420] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7129.83 | bwd: 3860.22 | bwd_inner: 3849.57 | bwd_allreduce: 10.45 | step: 61.81 72%|███████▏ | 430/600 [1:18:18<31:16, 11.04s/it] {'loss': 0.6206, 'learning_rate': 1.9615487026738543e-05, 'epoch': 4.3} 72%|███████▏ | 430/600 [1:18:18<31:16, 11.04s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3197, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.3573, device='cuda:0', grad_fn=) [2024-06-18 23:33:31,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.37 | bwd_microstep: 1922.97 | bwd_inner_microstep: 1917.99 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4734, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.4957, device='cuda:0', grad_fn=) [2024-06-18 23:33:36,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:33:36,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3557.91 | bwd_microstep: 1919.76 | bwd_inner_microstep: 1914.09 | bwd_allreduce_microstep: 5.56 | step_microstep: 63.27 [2024-06-18 23:33:36,643] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7114.24 | bwd: 3842.73 | bwd_inner: 3832.10 | bwd_allreduce: 10.43 | step: 63.36 72%|███████▏ | 431/600 [1:18:29<31:14, 11.09s/it] {'loss': 0.4265, 'learning_rate': 1.940158609225694e-05, 'epoch': 4.31} 72%|███████▏ | 431/600 [1:18:29<31:14, 11.09s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6275, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6347, device='cuda:0', grad_fn=) [2024-06-18 23:33:41,331] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2763.16 | bwd_microstep: 1824.51 | bwd_inner_microstep: 1819.34 | bwd_allreduce_microstep: 5.05 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6836, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.6740, device='cuda:0', grad_fn=) [2024-06-18 23:33:46,640] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:33:46,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3251.45 | bwd_microstep: 1892.60 | bwd_inner_microstep: 1887.12 | bwd_allreduce_microstep: 5.37 | step_microstep: 62.00 [2024-06-18 23:33:46,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6014.59 | bwd: 3717.10 | bwd_inner: 3706.47 | bwd_allreduce: 10.43 | step: 62.09 72%|███████▏ | 432/600 [1:18:39<30:08, 10.76s/it] {'loss': 0.6544, 'learning_rate': 1.9188576719953633e-05, 'epoch': 4.32} 72%|███████▏ | 432/600 [1:18:39<30:08, 10.76s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0287, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.1065, device='cuda:0', grad_fn=) [2024-06-18 23:33:52,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3493.52 | bwd_microstep: 1805.17 | bwd_inner_microstep: 1800.19 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0107, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.9687, device='cuda:0', grad_fn=) [2024-06-18 23:33:57,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:33:57,957] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3641.23 | bwd_microstep: 2106.60 | bwd_inner_microstep: 2101.13 | bwd_allreduce_microstep: 5.36 | step_microstep: 61.45 [2024-06-18 23:33:57,958] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7134.72 | bwd: 3911.77 | bwd_inner: 3901.34 | bwd_allreduce: 10.23 | step: 61.53 72%|███████▏ | 433/600 [1:18:51<30:25, 10.93s/it] {'loss': 0.5376, 'learning_rate': 1.8976465116395464e-05, 'epoch': 4.33} 72%|███████▏ | 433/600 [1:18:51<30:25, 10.93s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.3882, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.4194, device='cuda:0', grad_fn=) [2024-06-18 23:34:02,632] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2751.63 | bwd_microstep: 1823.19 | bwd_inner_microstep: 1818.07 | bwd_allreduce_microstep: 4.92 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0204, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0994, device='cuda:0', grad_fn=) [2024-06-18 23:34:08,088] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:34:08,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3494.56 | bwd_microstep: 1808.70 | bwd_inner_microstep: 1803.15 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.54 [2024-06-18 23:34:08,089] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6246.15 | bwd: 3631.88 | bwd_inner: 3621.33 | bwd_allreduce: 10.29 | step: 61.62 72%|███████▏ | 434/600 [1:19:01<29:34, 10.69s/it] {'loss': 0.2594, 'learning_rate': 1.8765257461990442e-05, 'epoch': 4.34} 72%|███████▏ | 434/600 [1:19:01<29:34, 10.69s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4298, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.4567, device='cuda:0', grad_fn=) [2024-06-18 23:34:13,625] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3546.36 | bwd_microstep: 1886.40 | bwd_inner_microstep: 1881.37 | bwd_allreduce_microstep: 4.85 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0533, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.1287, device='cuda:0', grad_fn=) [2024-06-18 23:34:18,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:34:18,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3464.51 | bwd_microstep: 1728.70 | bwd_inner_microstep: 1723.12 | bwd_allreduce_microstep: 5.41 | step_microstep: 61.35 [2024-06-18 23:34:18,967] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7010.84 | bwd: 3615.10 | bwd_inner: 3604.58 | bwd_allreduce: 10.21 | step: 61.43 72%|███████▎ | 435/600 [1:19:12<29:33, 10.75s/it] {'loss': 0.2927, 'learning_rate': 1.8554959910807775e-05, 'epoch': 4.35} 72%|███████▎ | 435/600 [1:19:12<29:33, 10.75s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0081, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0887, device='cuda:0', grad_fn=) [2024-06-18 23:34:24,191] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3444.07 | bwd_microstep: 1691.36 | bwd_inner_microstep: 1686.37 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5776, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5894, device='cuda:0', grad_fn=) [2024-06-18 23:34:29,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 23:34:29,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3555.25 | bwd_microstep: 1894.43 | bwd_inner_microstep: 1888.68 | bwd_allreduce_microstep: 5.63 | step_microstep: 62.42 [2024-06-18 23:34:29,800] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6999.31 | bwd: 3585.78 | bwd_inner: 3575.10 | bwd_allreduce: 10.43 | step: 62.50 73%|███████▎ | 436/600 [1:19:22<29:26, 10.77s/it] {'loss': 0.3391, 'learning_rate': 1.834557859039851e-05, 'epoch': 4.36} 73%|███████▎ | 436/600 [1:19:22<29:26, 10.77s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8836, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8652, device='cuda:0', grad_fn=) [2024-06-18 23:34:35,452] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3582.14 | bwd_microstep: 1964.03 | bwd_inner_microstep: 1959.06 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3439, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.3794, device='cuda:0', grad_fn=) [2024-06-18 23:34:41,171] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 23:34:41,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3586.97 | bwd_microstep: 1967.77 | bwd_inner_microstep: 1962.23 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.54 [2024-06-18 23:34:41,172] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7169.08 | bwd: 3931.80 | bwd_inner: 3921.32 | bwd_allreduce: 10.29 | step: 61.62 73%|███████▎ | 437/600 [1:19:34<29:45, 10.95s/it] {'loss': 0.6223, 'learning_rate': 1.813711960161696e-05, 'epoch': 4.37} 73%|███████▎ | 437/600 [1:19:34<29:45, 10.95s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5324, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5487, device='cuda:0', grad_fn=) [2024-06-18 23:34:46,709] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3547.02 | bwd_microstep: 1890.68 | bwd_inner_microstep: 1885.74 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4778, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.4889, device='cuda:0', grad_fn=) [2024-06-18 23:34:52,368] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:34:52,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3561.71 | bwd_microstep: 1932.70 | bwd_inner_microstep: 1927.14 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.14 [2024-06-18 23:34:52,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7108.71 | bwd: 3823.38 | bwd_inner: 3812.89 | bwd_allreduce: 10.30 | step: 61.22 73%|███████▎ | 438/600 [1:19:45<29:46, 11.03s/it] {'loss': 0.5188, 'learning_rate': 1.7929589018443016e-05, 'epoch': 4.38} 73%|███████▎ | 438/600 [1:19:45<29:46, 11.03s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0005, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0819, device='cuda:0', grad_fn=) [2024-06-18 23:34:57,766] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3495.33 | bwd_microstep: 1806.13 | bwd_inner_microstep: 1801.07 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5982, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6084, device='cuda:0', grad_fn=) [2024-06-18 23:35:03,357] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:35:03,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3545.98 | bwd_microstep: 1889.72 | bwd_inner_microstep: 1884.24 | bwd_allreduce_microstep: 5.37 | step_microstep: 61.24 [2024-06-18 23:35:03,358] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7041.28 | bwd: 3695.84 | bwd_inner: 3685.36 | bwd_allreduce: 10.24 | step: 61.32 73%|███████▎ | 439/600 [1:19:56<29:33, 11.01s/it] {'loss': 0.3451, 'learning_rate': 1.772299288780508e-05, 'epoch': 4.39} 73%|███████▎ | 439/600 [1:19:56<29:33, 11.01s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6237, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.6312, device='cuda:0', grad_fn=) [2024-06-18 23:35:09,000] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3576.80 | bwd_microstep: 1960.86 | bwd_inner_microstep: 1955.81 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8556, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.8396, device='cuda:0', grad_fn=) [2024-06-18 23:35:14,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:35:14,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3578.51 | bwd_microstep: 1955.64 | bwd_inner_microstep: 1950.08 | bwd_allreduce_microstep: 5.38 | step_microstep: 61.19 [2024-06-18 23:35:14,696] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7155.28 | bwd: 3916.50 | bwd_inner: 3905.97 | bwd_allreduce: 10.23 | step: 61.27 73%|███████▎ | 440/600 [1:20:07<29:37, 11.11s/it] {'loss': 0.7354, 'learning_rate': 1.7517337229403946e-05, 'epoch': 4.4} 73%|███████▎ | 440/600 [1:20:07<29:37, 11.11s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0023, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0836, device='cuda:0', grad_fn=) [2024-06-18 23:35:19,998] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3471.34 | bwd_microstep: 1738.46 | bwd_inner_microstep: 1733.23 | bwd_allreduce_microstep: 5.10 | step_microstep: 0.10 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6834, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.6850, device='cuda:0', grad_fn=) [2024-06-18 23:35:25,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:35:25,596] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3548.35 | bwd_microstep: 1890.92 | bwd_inner_microstep: 1885.39 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.88 [2024-06-18 23:35:25,597] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7019.65 | bwd: 3629.38 | bwd_inner: 3618.65 | bwd_allreduce: 10.53 | step: 61.98 74%|███████▎ | 441/600 [1:20:18<29:16, 11.05s/it] {'loss': 0.3843, 'learning_rate': 1.7312628035537387e-05, 'epoch': 4.41} 74%|███████▎ | 441/600 [1:20:18<29:16, 11.05s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6833, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6852, device='cuda:0', grad_fn=) [2024-06-18 23:35:31,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3547.07 | bwd_microstep: 1893.76 | bwd_inner_microstep: 1888.76 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.5945, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.5939, device='cuda:0', grad_fn=) [2024-06-18 23:35:34,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 23:35:34,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1803.83 | bwd_microstep: 1063.56 | bwd_inner_microstep: 1057.97 | bwd_allreduce_microstep: 5.49 | step_microstep: 62.74 [2024-06-18 23:35:34,144] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5350.87 | bwd: 2957.31 | bwd_inner: 2946.74 | bwd_allreduce: 10.39 | step: 62.83 74%|███████▎ | 442/600 [1:20:27<27:07, 10.30s/it] {'loss': 0.6396, 'learning_rate': 1.710887127092548e-05, 'epoch': 4.42} 74%|███████▎ | 442/600 [1:20:27<27:07, 10.30s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7416, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.7378, device='cuda:0', grad_fn=) [2024-06-18 23:35:39,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3542.64 | bwd_microstep: 1886.62 | bwd_inner_microstep: 1881.46 | bwd_allreduce_microstep: 5.06 | step_microstep: 0.14 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6514, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6566, device='cuda:0', grad_fn=) [2024-06-18 23:35:45,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:35:45,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3557.73 | bwd_microstep: 1920.87 | bwd_inner_microstep: 1915.41 | bwd_allreduce_microstep: 5.34 | step_microstep: 61.57 [2024-06-18 23:35:45,313] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7100.35 | bwd: 3807.49 | bwd_inner: 3796.89 | bwd_allreduce: 10.41 | step: 61.71 74%|███████▍ | 443/600 [1:20:38<27:37, 10.56s/it] {'loss': 0.6972, 'learning_rate': 1.6906072872536917e-05, 'epoch': 4.43} 74%|███████▍ | 443/600 [1:20:38<27:37, 10.56s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0060, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0869, device='cuda:0', grad_fn=) [2024-06-18 23:35:49,708] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2668.78 | bwd_microstep: 1637.90 | bwd_inner_microstep: 1632.92 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7348, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7201, device='cuda:0', grad_fn=) [2024-06-18 23:35:55,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:35:55,347] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3555.06 | bwd_microstep: 1922.86 | bwd_inner_microstep: 1917.38 | bwd_allreduce_microstep: 5.37 | step_microstep: 61.09 [2024-06-18 23:35:55,348] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6223.83 | bwd: 3560.76 | bwd_inner: 3550.31 | bwd_allreduce: 10.25 | step: 61.17 74%|███████▍ | 444/600 [1:20:48<27:02, 10.40s/it] {'loss': 0.4035, 'learning_rate': 1.6704238749415957e-05, 'epoch': 4.44} 74%|███████▍ | 444/600 [1:20:48<27:02, 10.40s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0015, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0828, device='cuda:0', grad_fn=) [2024-06-18 23:36:00,571] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3440.92 | bwd_microstep: 1692.42 | bwd_inner_microstep: 1687.35 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.2002, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.2501, device='cuda:0', grad_fn=) [2024-06-18 23:36:06,177] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:36:06,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.69 | bwd_microstep: 1899.54 | bwd_inner_microstep: 1894.09 | bwd_allreduce_microstep: 5.34 | step_microstep: 61.20 [2024-06-18 23:36:06,178] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6990.57 | bwd: 3591.95 | bwd_inner: 3581.50 | bwd_allreduce: 10.22 | step: 61.28 74%|███████▍ | 445/600 [1:20:59<27:12, 10.53s/it] {'loss': 0.1665, 'learning_rate': 1.6503374782510234e-05, 'epoch': 4.45} 74%|███████▍ | 445/600 [1:20:59<27:12, 10.53s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6800, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6820, device='cuda:0', grad_fn=) [2024-06-18 23:36:11,827] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3583.63 | bwd_microstep: 1959.83 | bwd_inner_microstep: 1954.72 | bwd_allreduce_microstep: 5.00 | step_microstep: 0.10 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5361, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.5524, device='cuda:0', grad_fn=) [2024-06-18 23:36:17,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.96 [2024-06-18 23:36:17,431] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3550.21 | bwd_microstep: 1896.63 | bwd_inner_microstep: 1891.09 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.99 [2024-06-18 23:36:17,432] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7133.83 | bwd: 3856.46 | bwd_inner: 3845.82 | bwd_allreduce: 10.45 | step: 62.09 74%|███████▍ | 446/600 [1:21:10<27:35, 10.75s/it] {'loss': 0.6172, 'learning_rate': 1.6303486824499458e-05, 'epoch': 4.46} 74%|███████▍ | 446/600 [1:21:10<27:35, 10.75s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3518, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.3866, device='cuda:0', grad_fn=) [2024-06-18 23:36:22,958] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3542.45 | bwd_microstep: 1884.83 | bwd_inner_microstep: 1879.88 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0163, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0957, device='cuda:0', grad_fn=) [2024-06-18 23:36:27,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:36:27,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2890.55 | bwd_microstep: 1746.18 | bwd_inner_microstep: 1740.58 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.34 [2024-06-18 23:36:27,747] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6432.95 | bwd: 3631.01 | bwd_inner: 3620.51 | bwd_allreduce: 10.25 | step: 61.42 74%|███████▍ | 447/600 [1:21:20<27:04, 10.62s/it] {'loss': 0.2412, 'learning_rate': 1.6104580699624837e-05, 'epoch': 4.47} 74%|███████▍ | 447/600 [1:21:20<27:04, 10.62s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4437, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.4693, device='cuda:0', grad_fn=) [2024-06-18 23:36:33,330] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3558.63 | bwd_microstep: 1921.30 | bwd_inner_microstep: 1916.30 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match 6.0398412668621895e-06, 'epoch': 5.08} 85%|████████▍ | 508/600 [1:32:35<16:49, 10.98s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5127, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.5317, device='cuda:0', grad_fn=) [2024-06-18 23:47:48,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3575.45 | bwd_microstep: 1957.53 | bwd_inner_microstep: 1952.58 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0190, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0986, device='cuda:0', grad_fn=) [2024-06-18 23:47:52,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:47:52,937] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2866.24 | bwd_microstep: 1675.63 | bwd_inner_microstep: 1670.10 | bwd_allreduce_microstep: 5.36 | step_microstep: 60.70 [2024-06-18 23:47:52,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6441.65 | bwd: 3633.16 | bwd_inner: 3622.73 | bwd_allreduce: 10.18 | step: 60.79 85%|████████▍ | 509/600 [1:32:46<16:21, 10.78s/it] {'loss': 0.3151, 'learning_rate': 5.91189105589992e-06, 'epoch': 5.09} 85%|████████▍ | 509/600 [1:32:46<16:21, 10.78s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0201, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0992, device='cuda:0', grad_fn=) [2024-06-18 23:47:58,222] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3464.38 | bwd_microstep: 1730.49 | bwd_inner_microstep: 1725.44 | bwd_allreduce_microstep: 4.92 | step_microstep: 0.14 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8276, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) [2024-06-18 23:48:03,033] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 2.02 [2024-06-18 23:48:03,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2781.57 | bwd_microstep: 1868.73 | bwd_inner_microstep: 1863.26 | bwd_allreduce_microstep: 5.36 | step_microstep: 61.23 [2024-06-18 23:48:03,034] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6245.95 | bwd: 3599.21 | bwd_inner: 3588.73 | bwd_allreduce: 10.28 | step: 61.38 85%|████████▌ | 510/600 [1:32:56<15:51, 10.58s/it] {'loss': 0.457, 'learning_rate': 5.785225463498828e-06, 'epoch': 5.1} 85%|████████▌ | 510/600 [1:32:56<15:51, 10.58s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0004, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0810, device='cuda:0', grad_fn=) [2024-06-18 23:48:08,428] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3497.56 | bwd_microstep: 1803.18 | bwd_inner_microstep: 1798.23 | bwd_allreduce_microstep: 4.84 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6451, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.6391, device='cuda:0', grad_fn=) [2024-06-18 23:48:14,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:48:14,161] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3586.89 | bwd_microstep: 1979.41 | bwd_inner_microstep: 1974.01 | bwd_allreduce_microstep: 5.29 | step_microstep: 60.60 [2024-06-18 23:48:14,162] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7084.42 | bwd: 3782.58 | bwd_inner: 3772.25 | bwd_allreduce: 10.14 | step: 60.69 85%|████████▌ | 511/600 [1:33:07<15:56, 10.74s/it] {'loss': 0.3601, 'learning_rate': 5.659848180381283e-06, 'epoch': 5.11} 85%|████████▌ | 511/600 [1:33:07<15:56, 10.74s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.0015, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0828, device='cuda:0', grad_fn=) [2024-06-18 23:48:16,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1699.75 | bwd_microstep: 827.56 | bwd_inner_microstep: 822.57 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3585, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.3811, device='cuda:0', grad_fn=) [2024-06-18 23:48:21,675] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:48:21,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2939.21 | bwd_microstep: 1828.10 | bwd_inner_microstep: 1822.78 | bwd_allreduce_microstep: 5.21 | step_microstep: 60.50 [2024-06-18 23:48:21,676] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4638.94 | bwd: 2655.66 | bwd_inner: 2645.41 | bwd_allreduce: 10.00 | step: 60.58 85%|████████▌ | 512/600 [1:33:14<14:20, 9.77s/it] {'loss': 0.232, 'learning_rate': 5.535762859731547e-06, 'epoch': 5.12} 85%|████████▌ | 512/600 [1:33:14<14:20, 9.77s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7290, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7261, device='cuda:0', grad_fn=) [2024-06-18 23:48:27,212] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3545.64 | bwd_microstep: 1892.61 | bwd_inner_microstep: 1887.82 | bwd_allreduce_microstep: 4.69 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.3869, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.4073, device='cuda:0', grad_fn=) [2024-06-18 23:48:32,035] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:48:32,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2782.53 | bwd_microstep: 1876.51 | bwd_inner_microstep: 1870.86 | bwd_allreduce_microstep: 5.47 | step_microstep: 61.32 [2024-06-18 23:48:32,036] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6328.13 | bwd: 3769.11 | bwd_inner: 3758.73 | bwd_allreduce: 10.15 | step: 61.40 86%|████████▌ | 513/600 [1:33:25<14:25, 9.95s/it] {'loss': 0.5667, 'learning_rate': 5.412973117089287e-06, 'epoch': 5.13} 86%|████████▌ | 513/600 [1:33:25<14:25, 9.95s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5313, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5482, device='cuda:0', grad_fn=) [2024-06-18 23:48:37,605] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3552.34 | bwd_microstep: 1915.31 | bwd_inner_microstep: 1910.44 | bwd_allreduce_microstep: 4.77 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5543, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5689, device='cuda:0', grad_fn=) [2024-06-18 23:48:43,239] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:48:43,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.57 | bwd_microstep: 1919.15 | bwd_inner_microstep: 1913.79 | bwd_allreduce_microstep: 5.28 | step_microstep: 60.61 [2024-06-18 23:48:43,240] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7108.89 | bwd: 3834.45 | bwd_inner: 3824.23 | bwd_allreduce: 10.05 | step: 60.69 86%|████████▌ | 514/600 [1:33:36<14:48, 10.33s/it] {'loss': 0.5585, 'learning_rate': 5.291482530244179e-06, 'epoch': 5.14} 86%|████████▌ | 514/600 [1:33:36<14:48, 10.33s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8302, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.8168, device='cuda:0', grad_fn=) [2024-06-18 23:48:48,876] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3574.55 | bwd_microstep: 1957.57 | bwd_inner_microstep: 1952.65 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.13 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1080, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.1675, device='cuda:0', grad_fn=) [2024-06-18 23:48:54,518] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:48:54,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3557.03 | bwd_microstep: 1924.15 | bwd_inner_microstep: 1918.71 | bwd_allreduce_microstep: 5.34 | step_microstep: 61.02 [2024-06-18 23:48:54,519] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7131.54 | bwd: 3881.72 | bwd_inner: 3871.38 | bwd_allreduce: 10.15 | step: 61.16 86%|████████▌ | 515/600 [1:33:47<15:01, 10.61s/it] {'loss': 0.4921, 'learning_rate': 5.171294639131779e-06, 'epoch': 5.15} 86%|████████▌ | 515/600 [1:33:47<15:01, 10.61s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5215, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.5284, device='cuda:0', grad_fn=) [2024-06-18 23:49:00,183] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3583.77 | bwd_microstep: 1973.29 | bwd_inner_microstep: 1968.39 | bwd_allreduce_microstep: 4.76 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4089, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.4271, device='cuda:0', grad_fn=) [2024-06-18 23:49:05,844] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:49:05,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3558.49 | bwd_microstep: 1938.70 | bwd_inner_microstep: 1933.31 | bwd_allreduce_microstep: 5.26 | step_microstep: 60.89 [2024-06-18 23:49:05,845] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7142.22 | bwd: 3911.99 | bwd_inner: 3901.78 | bwd_allreduce: 10.02 | step: 60.98 86%|████████▌ | 516/600 [1:33:59<15:09, 10.83s/it] {'loss': 0.4778, 'learning_rate': 5.05241294573024e-06, 'epoch': 5.16} 86%|████████▌ | 516/600 [1:33:59<15:09, 10.83s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5861, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5975, device='cuda:0', grad_fn=) [2024-06-18 23:49:11,494] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3579.63 | bwd_microstep: 1964.71 | bwd_inner_microstep: 1959.91 | bwd_allreduce_microstep: 4.69 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4293, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.4567, device='cuda:0', grad_fn=) [2024-06-18 23:49:17,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:49:17,142] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3554.52 | bwd_microstep: 1930.55 | bwd_inner_microstep: 1924.94 | bwd_allreduce_microstep: 5.49 | step_microstep: 63.08 [2024-06-18 23:49:17,143] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7134.12 | bwd: 3895.25 | bwd_inner: 3884.87 | bwd_allreduce: 10.19 | step: 63.16 86%|████████▌ | 517/600 [1:34:10<15:10, 10.97s/it] {'loss': 0.5271, 'learning_rate': 4.934840913958388e-06, 'epoch': 5.17} 86%|████████▌ | 517/600 [1:34:10<15:10, 10.97s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3530, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.3872, device='cuda:0', grad_fn=) [2024-06-18 23:49:22,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3543.75 | bwd_microstep: 1891.93 | bwd_inner_microstep: 1887.11 | bwd_allreduce_microstep: 4.72 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0985, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.1582, device='cuda:0', grad_fn=) [2024-06-18 23:49:28,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:49:28,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3539.55 | bwd_microstep: 1880.27 | bwd_inner_microstep: 1874.83 | bwd_allreduce_microstep: 5.27 | step_microstep: 60.85 [2024-06-18 23:49:28,250] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7083.28 | bwd: 3772.19 | bwd_inner: 3761.98 | bwd_allreduce: 9.97 | step: 60.93 86%|████████▋ | 518/600 [1:34:21<15:02, 11.01s/it] {'loss': 0.2727, 'learning_rate': 4.818581969574742e-06, 'epoch': 5.18} 86%|████████▋ | 518/600 [1:34:21<15:02, 11.01s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7370, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.7329, device='cuda:0', grad_fn=) [2024-06-18 23:49:33,835] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.22 | bwd_microstep: 1924.98 | bwd_inner_microstep: 1920.03 | bwd_allreduce_microstep: 4.80 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5436, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.5477, device='cuda:0', grad_fn=) [2024-06-18 23:49:39,495] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:49:39,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3562.54 | bwd_microstep: 1933.66 | bwd_inner_microstep: 1928.11 | bwd_allreduce_microstep: 5.43 | step_microstep: 61.52 [2024-06-18 23:49:39,496] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7118.74 | bwd: 3858.63 | bwd_inner: 3848.19 | bwd_allreduce: 10.24 | step: 61.60 86%|████████▋ | 519/600 [1:34:32<14:57, 11.08s/it] {'loss': 0.6403, 'learning_rate': 4.703639500077656e-06, 'epoch': 5.19} 86%|████████▋ | 519/600 [1:34:32<14:57, 11.08s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.2842, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.3256, device='cuda:0', grad_fn=) [2024-06-18 23:49:44,150] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2743.63 | bwd_microstep: 1809.86 | bwd_inner_microstep: 1805.01 | bwd_allreduce_microstep: 4.68 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6584, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.6622, device='cuda:0', grad_fn=) [2024-06-18 23:49:49,752] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.81 [2024-06-18 23:49:49,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3548.72 | bwd_microstep: 1895.61 | bwd_inner_microstep: 1890.10 | bwd_allreduce_microstep: 5.33 | step_microstep: 60.59 [2024-06-18 23:49:49,753] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6292.33 | bwd: 3705.47 | bwd_inner: 3695.20 | bwd_allreduce: 10.00 | step: 60.67 87%|████████▋ | 520/600 [1:34:42<14:26, 10.83s/it] {'loss': 0.4939, 'learning_rate': 4.590016854606727e-06, 'epoch': 5.2} 87%|████████▋ | 520/600 [1:34:42<14:26, 10.83s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6281, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.6244, device='cuda:0', grad_fn=) [2024-06-18 23:49:55,309] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3548.80 | bwd_microstep: 1904.17 | bwd_inner_microstep: 1899.33 | bwd_allreduce_microstep: 4.73 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4789, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.4898, device='cuda:0', grad_fn=) [2024-06-18 23:50:01,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:50:01,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3580.73 | bwd_microstep: 1970.62 | bwd_inner_microstep: 1965.18 | bwd_allreduce_microstep: 5.26 | step_microstep: 60.66 [2024-06-18 23:50:01,027] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7129.52 | bwd: 3874.78 | bwd_inner: 3864.57 | bwd_allreduce: 9.97 | step: 60.74 87%|████████▋ | 521/600 [1:34:54<14:26, 10.97s/it] {'loss': 0.5571, 'learning_rate': 4.477717343845078e-06, 'epoch': 5.21} 87%|████████▋ | 521/600 [1:34:54<14:26, 10.97s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3730, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.4053, device='cuda:0', grad_fn=) [2024-06-18 23:50:06,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.99 | bwd_microstep: 1928.52 | bwd_inner_microstep: 1923.62 | bwd_allreduce_microstep: 4.73 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1220, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.1793, device='cuda:0', grad_fn=) [2024-06-18 23:50:12,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:50:12,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3565.91 | bwd_microstep: 1924.94 | bwd_inner_microstep: 1919.41 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.02 [2024-06-18 23:50:12,270] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7125.87 | bwd: 3853.45 | bwd_inner: 3843.07 | bwd_allreduce: 10.13 | step: 61.10 87%|████████▋ | 522/600 [1:35:05<14:21, 11.05s/it] {'loss': 0.2923, 'learning_rate': 4.366744239922998e-06, 'epoch': 5.22} 87%|████████▋ | 522/600 [1:35:05<14:21, 11.05s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5597, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.5622, device='cuda:0', grad_fn=) [2024-06-18 23:50:16,938] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2754.30 | bwd_microstep: 1809.57 | bwd_inner_microstep: 1804.74 | bwd_allreduce_microstep: 4.72 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0319, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1098, device='cuda:0', grad_fn=) [2024-06-18 23:50:21,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 23:50:21,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2647.80 | bwd_microstep: 1607.26 | bwd_inner_microstep: 1601.85 | bwd_allreduce_microstep: 5.30 | step_microstep: 60.87 [2024-06-18 23:50:21,342] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5402.06 | bwd: 3416.82 | bwd_inner: 3406.61 | bwd_allreduce: 10.03 | step: 60.95 87%|████████▋ | 523/600 [1:35:14<13:25, 10.46s/it] {'loss': 0.336, 'learning_rate': 4.257100776322525e-06, 'epoch': 5.23} 87%|████████▋ | 523/600 [1:35:14<13:25, 10.46s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3022, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.3419, device='cuda:0', grad_fn=) [2024-06-18 23:50:26,893] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3547.52 | bwd_microstep: 1904.75 | bwd_inner_microstep: 1899.82 | bwd_allreduce_microstep: 4.77 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4614, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.4852, device='cuda:0', grad_fn=) [2024-06-18 23:50:32,490] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.88 [2024-06-18 23:50:32,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3550.18 | bwd_microstep: 1892.20 | bwd_inner_microstep: 1886.76 | bwd_allreduce_microstep: 5.31 | step_microstep: 60.88 [2024-06-18 23:50:32,491] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7097.68 | bwd: 3796.94 | bwd_inner: 3786.65 | bwd_allreduce: 10.05 | step: 60.96 87%|████████▋ | 524/600 [1:35:25<13:30, 10.66s/it] {'loss': 0.4136, 'learning_rate': 4.148790147783288e-06, 'epoch': 5.24} 87%|████████▋ | 524/600 [1:35:25<13:30, 10.66s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4929, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5136, device='cuda:0', grad_fn=) [2024-06-18 23:50:38,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3544.75 | bwd_microstep: 1890.31 | bwd_inner_microstep: 1885.46 | bwd_allreduce_microstep: 4.75 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7779, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.7592, device='cuda:0', grad_fn=) [2024-06-18 23:50:43,764] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:50:43,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3589.72 | bwd_microstep: 1982.23 | bwd_inner_microstep: 1976.67 | bwd_allreduce_microstep: 5.39 | step_microstep: 60.74 [2024-06-18 23:50:43,765] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7134.44 | bwd: 3872.54 | bwd_inner: 3862.18 | bwd_allreduce: 10.12 | step: 60.82 88%|████████▊ | 525/600 [1:35:36<13:33, 10.85s/it] {'loss': 0.6364, 'learning_rate': 4.041815510209396e-06, 'epoch': 5.25} 88%|████████▊ | 525/600 [1:35:36<13:33, 10.85s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3898, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.4208, device='cuda:0', grad_fn=) [2024-06-18 23:50:49,302] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3545.54 | bwd_microstep: 1892.95 | bwd_inner_microstep: 1888.09 | bwd_allreduce_microstep: 4.76 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7322, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.7181, device='cuda:0', grad_fn=) [2024-06-18 23:50:54,615] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:50:54,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3252.13 | bwd_microstep: 1895.72 | bwd_inner_microstep: 1890.30 | bwd_allreduce_microstep: 5.25 | step_microstep: 60.71 [2024-06-18 23:50:54,616] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6797.65 | bwd: 3788.67 | bwd_inner: 3778.44 | bwd_allreduce: 9.98 | step: 60.79 88%|████████▊ | 526/600 [1:35:47<13:22, 10.85s/it] {'loss': 0.5695, 'learning_rate': 3.936179980577453e-06, 'epoch': 5.26} 88%|████████▊ | 526/600 [1:35:47<13:22, 10.85s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4727, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.4950, device='cuda:0', grad_fn=) [2024-06-18 23:51:00,176] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3551.76 | bwd_microstep: 1909.69 | bwd_inner_microstep: 1904.75 | bwd_allreduce_microstep: 4.75 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0025, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0833, device='cuda:0', grad_fn=) [2024-06-18 23:51:04,641] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:51:04,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2676.72 | bwd_microstep: 1642.69 | bwd_inner_microstep: 1637.08 | bwd_allreduce_microstep: 5.44 | step_microstep: 61.06 [2024-06-18 23:51:04,642] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6228.45 | bwd: 3552.38 | bwd_inner: 3541.93 | bwd_allreduce: 10.17 | step: 61.14 88%|████████▊ | 527/600 [1:35:57<12:53, 10.60s/it] {'loss': 0.2892, 'learning_rate': 3.8318866368458e-06, 'epoch': 5.27} 88%|████████▊ | 527/600 [1:35:57<12:53, 10.60s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4742, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.4967, device='cuda:0', grad_fn=) [2024-06-18 23:51:10,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.90 | bwd_microstep: 1914.11 | bwd_inner_microstep: 1909.19 | bwd_allreduce_microstep: 4.74 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(9.4694e-05, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0807, device='cuda:0', grad_fn=) [2024-06-18 23:51:15,665] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 23:51:15,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3495.82 | bwd_microstep: 1813.49 | bwd_inner_microstep: 1808.00 | bwd_allreduce_microstep: 5.31 | step_microstep: 61.02 [2024-06-18 23:51:15,667] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7045.67 | bwd: 3727.59 | bwd_inner: 3717.28 | bwd_allreduce: 10.01 | step: 61.09 88%|████████▊ | 528/600 [1:36:08<12:52, 10.73s/it] {'loss': 0.2887, 'learning_rate': 3.728938517864794e-06, 'epoch': 5.28} 88%|████████▊ | 528/600 [1:36:08<12:52, 10.73s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4042, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.4333, device='cuda:0', grad_fn=) [2024-06-18 23:51:21,200] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3545.72 | bwd_microstep: 1889.52 | bwd_inner_microstep: 1884.54 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0004, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0810, device='cuda:0', grad_fn=) [2024-06-18 23:51:26,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:51:26,651] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3494.01 | bwd_microstep: 1806.56 | bwd_inner_microstep: 1801.00 | bwd_allreduce_microstep: 5.38 | step_microstep: 60.93 [2024-06-18 23:51:26,652] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7039.69 | bwd: 3696.08 | bwd_inner: 3685.62 | bwd_allreduce: 10.15 | step: 61.01 88%|████████▊ | 529/600 [1:36:19<12:47, 10.81s/it] {'loss': 0.2571, 'learning_rate': 3.6273386232882343e-06, 'epoch': 5.29} 88%|████████▊ | 529/600 [1:36:19<12:47, 10.81s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5148, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.5336, device='cuda:0', grad_fn=) [2024-06-18 23:51:32,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3544.07 | bwd_microstep: 1887.93 | bwd_inner_microstep: 1882.95 | bwd_allreduce_microstep: 4.79 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5562, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.5590, device='cuda:0', grad_fn=) [2024-06-18 23:51:37,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:51:37,804] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3552.27 | bwd_microstep: 1910.18 | bwd_inner_microstep: 1904.87 | bwd_allreduce_microstep: 5.20 | step_microstep: 60.66 [2024-06-18 23:51:37,805] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7096.32 | bwd: 3798.10 | bwd_inner: 3787.87 | bwd_allreduce: 10.00 | step: 60.75 88%|████████▊ | 530/600 [1:36:30<12:43, 10.91s/it] {'loss': 0.5463, 'learning_rate': 3.527089913486037e-06, 'epoch': 5.3} 88%|████████▊ | 530/600 [1:36:30<12:43, 10.91s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0183, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0975, device='cuda:0', grad_fn=) [2024-06-18 23:51:42,364] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2838.96 | bwd_microstep: 1630.31 | bwd_inner_microstep: 1625.48 | bwd_allreduce_microstep: 4.72 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8199, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.8079, device='cuda:0', grad_fn=) [2024-06-18 23:51:47,965] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:51:47,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3548.62 | bwd_microstep: 1894.76 | bwd_inner_microstep: 1889.29 | bwd_allreduce_microstep: 5.29 | step_microstep: 60.83 [2024-06-18 23:51:47,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6387.55 | bwd: 3525.06 | bwd_inner: 3514.82 | bwd_allreduce: 10.00 | step: 60.91 88%|████████▊ | 531/600 [1:36:41<12:17, 10.69s/it] {'loss': 0.4527, 'learning_rate': 3.4281953094578877e-06, 'epoch': 5.31} 88%|████████▊ | 531/600 [1:36:41<12:17, 10.69s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0238, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1025, device='cuda:0', grad_fn=) [2024-06-18 23:51:53,260] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3476.05 | bwd_microstep: 1726.58 | bwd_inner_microstep: 1721.57 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3100, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.3493, device='cuda:0', grad_fn=) [2024-06-18 23:51:58,859] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:51:58,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3548.09 | bwd_microstep: 1894.74 | bwd_inner_microstep: 1889.27 | bwd_allreduce_microstep: 5.29 | step_microstep: 61.15 [2024-06-18 23:51:58,860] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7024.14 | bwd: 3621.31 | bwd_inner: 3610.91 | bwd_allreduce: 10.15 | step: 61.23 89%|████████▊ | 532/600 [1:36:52<12:10, 10.75s/it] {'loss': 0.2259, 'learning_rate': 3.3306576927482126e-06, 'epoch': 5.32} 89%|████████▊ | 532/600 [1:36:52<12:10, 10.75s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5172, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.5240, device='cuda:0', grad_fn=) [2024-06-18 23:52:04,458] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3560.97 | bwd_microstep: 1931.78 | bwd_inner_microstep: 1926.78 | bwd_allreduce_microstep: 4.82 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4206, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.4373, device='cuda:0', grad_fn=) [2024-06-18 23:52:10,122] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:52:10,123] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3563.05 | bwd_microstep: 1935.14 | bwd_inner_microstep: 1928.19 | bwd_allreduce_microstep: 6.76 | step_microstep: 61.89 [2024-06-18 23:52:10,123] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7124.00 | bwd: 3866.91 | bwd_inner: 3855.07 | bwd_allreduce: 11.54 | step: 61.96 89%|████████▉ | 533/600 [1:37:03<12:10, 10.90s/it] {'loss': 0.4806, 'learning_rate': 3.2344799053621646e-06, 'epoch': 5.33} 89%|████████▉ | 533/600 [1:37:03<12:10, 10.90s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4549, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.4686, device='cuda:0', grad_fn=) [2024-06-18 23:52:15,716] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3558.06 | bwd_microstep: 1927.95 | bwd_inner_microstep: 1923.02 | bwd_allreduce_microstep: 4.76 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.2105, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.2590, device='cuda:0', grad_fn=) [2024-06-18 23:52:19,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.81 [2024-06-18 23:52:19,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2214.31 | bwd_microstep: 1317.05 | bwd_inner_microstep: 1311.66 | bwd_allreduce_microstep: 5.24 | step_microstep: 60.27 [2024-06-18 23:52:19,389] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5772.36 | bwd: 3245.00 | bwd_inner: 3234.77 | bwd_allreduce: 9.99 | step: 60.35 89%|████████▉ | 534/600 [1:37:12<11:27, 10.41s/it] {'loss': 0.3638, 'learning_rate': 3.1396647496828247e-06, 'epoch': 5.34} 89%|████████▉ | 534/600 [1:37:12<11:27, 10.41s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5276, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5444, device='cuda:0', grad_fn=) [2024-06-18 23:52:23,966] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2841.25 | bwd_microstep: 1641.33 | bwd_inner_microstep: 1636.48 | bwd_allreduce_microstep: 4.75 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5670, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.5691, device='cuda:0', grad_fn=) [2024-06-18 23:52:29,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:52:29,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3552.67 | bwd_microstep: 1909.03 | bwd_inner_microstep: 1903.27 | bwd_allreduce_microstep: 5.64 | step_microstep: 60.72 [2024-06-18 23:52:29,587] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6393.90 | bwd: 3550.36 | bwd_inner: 3539.77 | bwd_allreduce: 10.40 | step: 60.80 89%|████████▉ | 535/600 [1:37:22<11:12, 10.35s/it] {'loss': 0.5568, 'learning_rate': 3.0462149883895563e-06, 'epoch': 5.35} 89%|████████▉ | 535/600 [1:37:22<11:12, 10.35s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5359, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.5411, device='cuda:0', grad_fn=) [2024-06-18 23:52:35,187] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.44 | bwd_microstep: 1934.02 | bwd_inner_microstep: 1929.22 | bwd_allreduce_microstep: 4.72 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7000, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.6885, device='cuda:0', grad_fn=) [2024-06-18 23:52:40,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:52:40,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.15 | bwd_microstep: 1922.80 | bwd_inner_microstep: 1917.37 | bwd_allreduce_microstep: 5.25 | step_microstep: 60.75 [2024-06-18 23:52:40,826] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7115.57 | bwd: 3856.82 | bwd_inner: 3846.62 | bwd_allreduce: 9.97 | step: 60.83 89%|████████▉ | 536/600 [1:37:33<11:19, 10.61s/it] {'loss': 0.6148, 'learning_rate': 2.9541333443775243e-06, 'epoch': 5.36} 89%|████████▉ | 536/600 [1:37:33<11:19, 10.61s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0085, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0887, device='cuda:0', grad_fn=) [2024-06-18 23:52:45,443] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2863.28 | bwd_microstep: 1663.40 | bwd_inner_microstep: 1658.40 | bwd_allreduce_microstep: 4.89 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.0029, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0837, device='cuda:0', grad_fn=) [2024-06-18 23:52:47,409] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:52:47,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1407.49 | bwd_microstep: 454.00 | bwd_inner_microstep: 448.67 | bwd_allreduce_microstep: 5.23 | step_microstep: 61.30 [2024-06-18 23:52:47,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4270.76 | bwd: 2117.39 | bwd_inner: 2107.08 | bwd_allreduce: 10.12 | step: 61.39 90%|████████▉ | 537/600 [1:37:40<09:52, 9.41s/it] {'loss': 0.0862, 'learning_rate': 2.8634225006782865e-06, 'epoch': 5.37} 90%|████████▉ | 537/600 [1:37:40<09:52, 9.41s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4126, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.4412, device='cuda:0', grad_fn=) [2024-06-18 23:52:52,934] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3538.48 | bwd_microstep: 1888.69 | bwd_inner_microstep: 1883.82 | bwd_allreduce_microstep: 4.76 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6834, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6850, device='cuda:0', grad_fn=) [2024-06-18 23:52:58,532] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:52:58,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3547.90 | bwd_microstep: 1895.09 | bwd_inner_microstep: 1889.69 | bwd_allreduce_microstep: 5.23 | step_microstep: 60.71 [2024-06-18 23:52:58,533] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7086.36 | bwd: 3783.77 | bwd_inner: 3773.56 | bwd_allreduce: 9.98 | step: 60.79 90%|████████▉ | 538/600 [1:37:51<10:15, 9.92s/it] {'loss': 0.5631, 'learning_rate': 2.774085100381735e-06, 'epoch': 5.38} 90%|████████▉ | 538/600 [1:37:51<10:15, 9.92s/it]warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4766, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.4989, device='cuda:0', grad_fn=) [2024-06-18 23:53:03,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3139.02 | bwd_microstep: 1669.20 | bwd_inner_microstep: 1664.15 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.13 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.9313, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.8973, device='cuda:0', grad_fn=) [2024-06-18 23:53:08,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:53:08,305] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2790.64 | bwd_microstep: 1907.53 | bwd_inner_microstep: 1902.15 | bwd_allreduce_microstep: 5.25 | step_microstep: 61.08 [2024-06-18 23:53:08,306] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5929.60 | bwd: 3576.73 | bwd_inner: 3566.38 | bwd_allreduce: 10.10 | step: 61.22 90%|████████▉ | 539/600 [1:38:01<10:02, 9.88s/it] {'loss': 0.6981, 'learning_rate': 2.686123746558961e-06, 'epoch': 5.39} 90%|████████▉ | 539/600 [1:38:01<10:02, 9.88s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3034, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.3431, device='cuda:0', grad_fn=) [2024-06-18 23:53:13,884] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3555.42 | bwd_microstep: 1922.03 | bwd_inner_microstep: 1917.18 | bwd_allreduce_microstep: 4.75 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0006, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0816, device='cuda:0', grad_fn=) [2024-06-18 23:53:19,253] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:53:19,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3474.58 | bwd_microstep: 1743.79 | bwd_inner_microstep: 1738.36 | bwd_allreduce_microstep: 5.32 | step_microstep: 60.85 [2024-06-18 23:53:19,254] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7029.96 | bwd: 3665.81 | bwd_inner: 3655.55 | bwd_allreduce: 10.08 | step: 60.94 90%|█████████ | 540/600 [1:38:12<10:11, 10.20s/it] {'loss': 0.2124, 'learning_rate': 2.5995410021864787e-06, 'epoch': 5.4} 90%|█████████ | 540/600 [1:38:12<10:11, 10.20s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.0126, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0928, device='cuda:0', grad_fn=) [2024-06-18 23:53:22,038] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1882.42 | bwd_microstep: 839.92 | bwd_inner_microstep: 835.06 | bwd_allreduce_microstep: 4.76 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1886, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.2282, device='cuda:0', grad_fn=) [2024-06-18 23:53:27,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:53:27,638] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3544.88 | bwd_microstep: 1897.90 | bwd_inner_microstep: 1892.42 | bwd_allreduce_microstep: 5.30 | step_microstep: 60.65 [2024-06-18 23:53:27,639] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5427.28 | bwd: 2737.81 | bwd_inner: 2727.52 | bwd_allreduce: 10.05 | step: 60.73 90%|█████████ | 541/600 [1:38:20<09:29, 9.65s/it] {'loss': 0.1605, 'learning_rate': 2.5143393900715296e-06, 'epoch': 5.41} 90%|█████████ | 541/600 [1:38:20<09:29, 9.65s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.3511, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.3859, device='cuda:0', grad_fn=) [2024-06-18 23:53:32,269] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2862.25 | bwd_microstep: 1676.33 | bwd_inner_microstep: 1671.26 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.09 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1024) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1024, 6144]) tensor(0.6229, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6306, device='cuda:0', grad_fn=) [2024-06-18 23:53:37,077] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:53:37,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2991.66 | bwd_microstep: 1661.71 | bwd_inner_microstep: 1656.25 | bwd_allreduce_microstep: 5.28 | step_microstep: 60.71 [2024-06-18 23:53:37,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5853.89 | bwd: 3338.03 | bwd_inner: 3327.57 | bwd_allreduce: 10.19 | step: 60.81 90%|█████████ | 542/600 [1:38:30<09:16, 9.59s/it] {'loss': 0.5083, 'learning_rate': 2.430521392778573e-06, 'epoch': 5.42} 90%|█████████ | 542/600 [1:38:30<09:16, 9.59s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) tensor(0.0181, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0974, device='cuda:0', grad_fn=) [2024-06-18 23:53:42,128] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3186.11 | bwd_microstep: 1770.45 | bwd_inner_microstep: 1765.46 | bwd_allreduce_microstep: 4.82 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7572, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7515, device='cuda:0', grad_fn=) [2024-06-18 23:53:47,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-18 23:53:47,831] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3578.76 | bwd_microstep: 1961.60 | bwd_inner_microstep: 1956.00 | bwd_allreduce_microstep: 5.41 | step_microstep: 61.11 [2024-06-18 23:53:47,832] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6764.85 | bwd: 3732.04 | bwd_inner: 3721.55 | bwd_allreduce: 10.19 | step: 61.19 90%|█████████ | 543/600 [1:38:40<09:26, 9.94s/it] {'loss': 0.4244, 'learning_rate': 2.3480894525569562e-06, 'epoch': 5.43} 90%|█████████ | 543/600 [1:38:40<09:26, 9.94s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4049, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.4339, device='cuda:0', grad_fn=) [2024-06-18 23:53:53,395] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3550.33 | bwd_microstep: 1912.43 | bwd_inner_microstep: 1907.55 | bwd_allreduce_microstep: 4.70 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0033, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0841, device='cuda:0', grad_fn=) [2024-06-18 23:53:58,849] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:53:58,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3495.99 | bwd_microstep: 1806.61 | bwd_inner_microstep: 1801.07 | bwd_allreduce_microstep: 5.36 | step_microstep: 60.70 [2024-06-18 23:53:58,850] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7046.28 | bwd: 3719.03 | bwd_inner: 3708.71 | bwd_allreduce: 10.04 | step: 60.78 91%|█████████ | 544/600 [1:38:52<09:34, 10.26s/it] {'loss': 0.259, 'learning_rate': 2.2670459712697377e-06, 'epoch': 5.44} 91%|█████████ | 544/600 [1:38:52<09:34, 10.26s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0005, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0811, device='cuda:0', grad_fn=) [2024-06-18 23:54:04,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3474.00 | bwd_microstep: 1739.22 | bwd_inner_microstep: 1734.27 | bwd_allreduce_microstep: 4.79 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0149, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0949, device='cuda:0', grad_fn=) [2024-06-18 23:54:09,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:54:09,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3493.42 | bwd_microstep: 1807.30 | bwd_inner_microstep: 1801.75 | bwd_allreduce_microstep: 5.37 | step_microstep: 60.80 [2024-06-18 23:54:09,607] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6967.40 | bwd: 3546.52 | bwd_inner: 3536.11 | bwd_allreduce: 10.14 | step: 60.89 91%|█████████ | 545/600 [1:39:02<09:32, 10.41s/it] {'loss': 0.088, 'learning_rate': 2.187393310323721e-06, 'epoch': 5.45} 91%|█████████ | 545/600 [1:39:02<09:32, 10.41s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0091, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0893, device='cuda:0', grad_fn=) [2024-06-18 23:54:14,746] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3411.63 | bwd_microstep: 1638.43 | bwd_inner_microstep: 1633.48 | bwd_allreduce_microstep: 4.78 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.2310, device='cuda:0', grad_fn=) tensor(0.7039, device='cuda:0', grad_fn=) tensor(0.2783, device='cuda:0', grad_fn=) [2024-06-18 23:54:20,324] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:54:20,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3541.81 | bwd_microstep: 1882.51 | bwd_inner_microstep: 1877.23 | bwd_allreduce_microstep: 5.21 | step_microstep: 61.01 [2024-06-18 23:54:20,325] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6953.42 | bwd: 3520.94 | bwd_inner: 3510.74 | bwd_allreduce: 9.99 | step: 61.09 91%|█████████ | 546/600 [1:39:13<09:27, 10.50s/it] {'loss': 0.1838, 'learning_rate': 2.1091337906006482e-06, 'epoch': 5.46} 91%|█████████ | 546/600 [1:39:13<09:27, 10.50s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0028, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0836, device='cuda:0', grad_fn=) [2024-06-18 23:54:25,720] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3499.53 | bwd_microstep: 1802.27 | bwd_inner_microstep: 1797.25 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(1.0673, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(1.0306, device='cuda:0', grad_fn=) [2024-06-18 23:54:31,470] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.94 [2024-06-18 23:54:31,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3588.49 | bwd_microstep: 1994.26 | bwd_inner_microstep: 1988.78 | bwd_allreduce_microstep: 5.37 | step_microstep: 61.55 [2024-06-18 23:54:31,471] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7088.00 | bwd: 3796.52 | bwd_inner: 3786.05 | bwd_allreduce: 10.28 | step: 61.64 91%|█████████ | 547/600 [1:39:24<09:26, 10.70s/it] {'loss': 0.5571, 'learning_rate': 2.0322696923895434e-06, 'epoch': 5.47} 91%|█████████ | 547/600 [1:39:24<09:26, 10.70s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0101, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0906, device='cuda:0', grad_fn=) [2024-06-18 23:54:36,777] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3472.02 | bwd_microstep: 1741.99 | bwd_inner_microstep: 1737.06 | bwd_allreduce_microstep: 4.75 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5384, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5541, device='cuda:0', grad_fn=) [2024-06-18 23:54:42,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.81 [2024-06-18 23:54:42,410] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3554.62 | bwd_microstep: 1920.72 | bwd_inner_microstep: 1915.41 | bwd_allreduce_microstep: 5.24 | step_microstep: 60.66 [2024-06-18 23:54:42,411] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7026.62 | bwd: 3662.71 | bwd_inner: 3652.50 | bwd_allreduce: 9.98 | step: 60.74 91%|█████████▏| 548/600 [1:39:35<09:19, 10.77s/it] {'loss': 0.3224, 'learning_rate': 1.956803255320322e-06, 'epoch': 5.48} 91%|█████████▏| 548/600 [1:39:35<09:19, 10.77s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6406, device='cuda:0', grad_fn=) tensor(0.7039, device='cuda:0', grad_fn=) tensor(0.6469, device='cuda:0', grad_fn=) [2024-06-18 23:54:47,955] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3547.29 | bwd_microstep: 1896.08 | bwd_inner_microstep: 1891.07 | bwd_allreduce_microstep: 4.83 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7847, device='cuda:0', grad_fn=) [2024-06-18 23:54:53,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:54:53,223] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3241.73 | bwd_microstep: 1867.50 | bwd_inner_microstep: 1862.18 | bwd_allreduce_microstep: 5.21 | step_microstep: 61.00 [2024-06-18 23:54:53,224] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6789.00 | bwd: 3763.57 | bwd_inner: 3753.30 | bwd_allreduce: 10.01 | step: 61.08 92%|█████████▏| 549/600 [1:39:46<09:09, 10.78s/it] {'loss': 0.7158, 'learning_rate': 1.8827366782984913e-06, 'epoch': 5.49} 92%|█████████▏| 549/600 [1:39:46<09:09, 10.78s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6878, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.6889, device='cuda:0', grad_fn=) [2024-06-18 23:54:57,666] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2679.18 | bwd_microstep: 1664.40 | bwd_inner_microstep: 1659.54 | bwd_allreduce_microstep: 4.75 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1993, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.2497, device='cuda:0', grad_fn=) [2024-06-18 23:55:03,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:55:03,246] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3539.91 | bwd_microstep: 1884.95 | bwd_inner_microstep: 1879.53 | bwd_allreduce_microstep: 5.31 | step_microstep: 60.63 [2024-06-18 23:55:03,247] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6219.08 | bwd: 3549.35 | bwd_inner: 3539.09 | bwd_allreduce: 10.07 | step: 60.71 92%|█████████▏| 550/600 [1:39:56<08:47, 10.55s/it] {'loss': 0.4693, 'learning_rate': 1.810072119441103e-06, 'epoch': 5.5} 92%|█████████▏| 550/600 [1:39:56<08:47, 10.55s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3213, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.3591, device='cuda:0', grad_fn=) [2024-06-18 23:55:08,810] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3548.24 | bwd_microstep: 1913.75 | bwd_inner_microstep: 1908.79 | bwd_allreduce_microstep: 4.78 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4867, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.4965, device='cuda:0', grad_fn=) [2024-06-18 23:55:14,540] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 2.01 [2024-06-18 23:55:14,541] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3583.67 | bwd_microstep: 1978.45 | bwd_inner_microstep: 1973.01 | bwd_allreduce_microstep: 5.26 | step_microstep: 63.75 [2024-06-18 23:55:14,542] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7131.88 | bwd: 3892.19 | bwd_inner: 3881.91 | bwd_allreduce: 10.03 | step: 63.83 92%|█████████▏| 551/600 [1:40:07<08:48, 10.78s/it] {'loss': 0.4278, 'learning_rate': 1.73881169601387e-06, 'epoch': 5.51} 92%|█████████▏| 551/600 [1:40:07<08:48, 10.78s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0016, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0821, device='cuda:0', grad_fn=) [2024-06-18 23:55:19,928] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3492.35 | bwd_microstep: 1800.42 | bwd_inner_microstep: 1795.52 | bwd_allreduce_microstep: 4.78 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0003, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0818, device='cuda:0', grad_fn=) [2024-06-18 23:55:25,372] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:55:25,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3493.24 | bwd_microstep: 1802.94 | bwd_inner_microstep: 1797.36 | bwd_allreduce_microstep: 5.40 | step_microstep: 60.89 [2024-06-18 23:55:25,373] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6985.55 | bwd: 3603.35 | bwd_inner: 3592.94 | bwd_allreduce: 10.17 | step: 60.97 92%|█████████▏| 552/600 [1:40:18<08:38, 10.79s/it] {'loss': 0.0819, 'learning_rate': 1.6689574843694433e-06, 'epoch': 5.52} 92%|█████████▏| 552/600 [1:40:18<08:38, 10.79s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.0010, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0820, device='cuda:0', grad_fn=) [2024-06-18 23:55:28,196] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1894.80 | bwd_microstep: 862.41 | bwd_inner_microstep: 857.53 | bwd_allreduce_microstep: 4.70 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7629, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7454, device='cuda:0', grad_fn=) [2024-06-18 23:55:32,901] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.81 [2024-06-18 23:55:32,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2744.73 | bwd_microstep: 1802.92 | bwd_inner_microstep: 1797.48 | bwd_allreduce_microstep: 5.26 | step_microstep: 61.01 [2024-06-18 23:55:32,902] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 4639.51 | bwd: 2665.32 | bwd_inner: 2655.11 | bwd_allreduce: 9.92 | step: 61.09 92%|█████████▏| 553/600 [1:40:26<07:41, 9.81s/it] {'loss': 0.4137, 'learning_rate': 1.6005115198869603e-06, 'epoch': 5.53} 92%|█████████▏| 553/600 [1:40:26<07:41, 9.81s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4080, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.4375, device='cuda:0', grad_fn=) [2024-06-18 23:55:38,441] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3546.40 | bwd_microstep: 1892.71 | bwd_inner_microstep: 1887.86 | bwd_allreduce_microstep: 4.74 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6088, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.6067, device='cuda:0', grad_fn=) [2024-06-18 23:55:44,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:55:44,156] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3579.38 | bwd_microstep: 1969.79 | bwd_inner_microstep: 1964.36 | bwd_allreduce_microstep: 5.24 | step_microstep: 60.79 [2024-06-18 23:55:44,157] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7125.76 | bwd: 3862.49 | bwd_inner: 3852.28 | bwd_allreduce: 9.98 | step: 60.87 92%|█████████▏| 554/600 [1:40:37<07:51, 10.25s/it] {'loss': 0.5221, 'learning_rate': 1.53347579691272e-06, 'epoch': 5.54} 92%|█████████▏| 554/600 [1:40:37<07:51, 10.25s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4685, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.4916, device='cuda:0', grad_fn=) [2024-06-18 23:55:49,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3554.04 | bwd_microstep: 1925.04 | bwd_inner_microstep: 1920.19 | bwd_allreduce_microstep: 4.73 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5176, device='cuda:0', grad_fn=) tensor(0.7039, device='cuda:0', grad_fn=) tensor(0.5362, device='cuda:0', grad_fn=) [2024-06-18 23:55:55,379] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.81 [2024-06-18 23:55:55,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3558.04 | bwd_microstep: 1923.48 | bwd_inner_microstep: 1918.07 | bwd_allreduce_microstep: 5.24 | step_microstep: 60.91 [2024-06-18 23:55:55,380] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7112.03 | bwd: 3848.51 | bwd_inner: 3838.32 | bwd_allreduce: 9.95 | step: 60.99 92%|█████████▎| 555/600 [1:40:48<07:54, 10.54s/it] {'loss': 0.5139, 'learning_rate': 1.4678522687020413e-06, 'epoch': 5.55} 92%|█████████▎| 555/600 [1:40:48<07:54, 10.54s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4129, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.4416, device='cuda:0', grad_fn=) [2024-06-18 23:56:00,914] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3543.91 | bwd_microstep: 1891.42 | bwd_inner_microstep: 1886.58 | bwd_allreduce_microstep: 4.73 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) [2024-06-18 23:56:05,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:56:05,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2758.37 | bwd_microstep: 1835.00 | bwd_inner_microstep: 1829.39 | bwd_allreduce_microstep: 5.49 | step_microstep: 61.61 [2024-06-18 23:56:05,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6302.26 | bwd: 3726.41 | bwd_inner: 3715.99 | bwd_allreduce: 10.23 | step: 61.69 93%|█████████▎| 556/600 [1:40:58<07:40, 10.46s/it] {'loss': 0.5131, 'learning_rate': 1.4036428473624019e-06, 'epoch': 5.56} 93%|█████████▎| 556/600 [1:40:58<07:40, 10.46s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3138, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.3524, device='cuda:0', grad_fn=) [2024-06-18 23:56:11,312] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3575.61 | bwd_microstep: 1962.55 | bwd_inner_microstep: 1957.78 | bwd_allreduce_microstep: 4.69 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4827, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5040, device='cuda:0', grad_fn=) [2024-06-18 23:56:16,907] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 23:56:16,908] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3548.45 | bwd_microstep: 1892.09 | bwd_inner_microstep: 1886.54 | bwd_allreduce_microstep: 5.36 | step_microstep: 60.97 [2024-06-18 23:56:16,908] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7124.02 | bwd: 3854.63 | bwd_inner: 3844.36 | bwd_allreduce: 10.04 | step: 61.04 93%|█████████▎| 557/600 [1:41:10<07:39, 10.70s/it] {'loss': 0.4282, 'learning_rate': 1.3408494037976894e-06, 'epoch': 5.57} 93%|█████████▎| 557/600 [1:41:10<07:39, 10.70s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5618, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.5760, device='cuda:0', grad_fn=) [2024-06-18 23:56:22,438] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3542.43 | bwd_microstep: 1889.22 | bwd_inner_microstep: 1884.40 | bwd_allreduce_microstep: 4.73 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7421, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.7263, device='cuda:0', grad_fn=) [2024-06-18 23:56:28,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:56:28,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3580.46 | bwd_microstep: 1971.17 | bwd_inner_microstep: 1965.63 | bwd_allreduce_microstep: 5.36 | step_microstep: 60.89 [2024-06-18 23:56:28,154] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7122.85 | bwd: 3860.39 | bwd_inner: 3850.07 | bwd_allreduce: 10.06 | step: 60.97 93%|█████████▎| 558/600 [1:41:21<07:36, 10.86s/it] {'loss': 0.6512, 'learning_rate': 1.2794737676536994e-06, 'epoch': 5.58} 93%|█████████▎| 558/600 [1:41:21<07:36, 10.86s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5961, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.5953, device='cuda:0', grad_fn=) [2024-06-18 23:56:33,803] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3574.07 | bwd_microstep: 1966.24 | bwd_inner_microstep: 1961.28 | bwd_allreduce_microstep: 4.79 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6479, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.6531, device='cuda:0', grad_fn=) [2024-06-18 23:56:39,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:56:39,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3551.11 | bwd_microstep: 1913.84 | bwd_inner_microstep: 1908.40 | bwd_allreduce_microstep: 5.30 | step_microstep: 60.81 [2024-06-18 23:56:39,426] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7125.16 | bwd: 3880.07 | bwd_inner: 3869.76 | bwd_allreduce: 10.07 | step: 60.89 93%|█████████▎| 559/600 [1:41:32<07:30, 10.98s/it] {'loss': 0.6242, 'learning_rate': 1.2195177272648127e-06, 'epoch': 5.59} 93%|█████████▎| 559/600 [1:41:32<07:30, 10.98s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7685, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.7612, device='cuda:0', grad_fn=) [2024-06-18 23:56:45,059] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3573.71 | bwd_microstep: 1956.43 | bwd_inner_microstep: 1951.34 | bwd_allreduce_microstep: 4.98 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6520, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.6453, device='cuda:0', grad_fn=) [2024-06-18 23:56:50,772] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:56:50,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3581.43 | bwd_microstep: 1968.29 | bwd_inner_microstep: 1962.94 | bwd_allreduce_microstep: 5.25 | step_microstep: 60.91 [2024-06-18 23:56:50,773] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7155.12 | bwd: 3924.71 | bwd_inner: 3914.29 | bwd_allreduce: 10.24 | step: 60.99 93%|█████████▎| 560/600 [1:41:43<07:23, 11.09s/it] {'loss': 0.7032, 'learning_rate': 1.1609830296019143e-06, 'epoch': 5.6} 93%|█████████▎| 560/600 [1:41:43<07:23, 11.09s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.5748, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.5876, device='cuda:0', grad_fn=) [2024-06-18 23:56:55,425] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2746.20 | bwd_microstep: 1807.64 | bwd_inner_microstep: 1802.83 | bwd_allreduce_microstep: 4.71 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0004, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0810, device='cuda:0', grad_fn=) [2024-06-18 23:57:00,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:57:00,878] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3495.54 | bwd_microstep: 1807.41 | bwd_inner_microstep: 1801.87 | bwd_allreduce_microstep: 5.45 | step_microstep: 60.87 [2024-06-18 23:57:00,879] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6241.70 | bwd: 3615.05 | bwd_inner: 3604.69 | bwd_allreduce: 10.17 | step: 60.94 94%|█████████▎| 561/600 [1:41:54<07:01, 10.80s/it] {'loss': 0.3343, 'learning_rate': 1.1038713802214717e-06, 'epoch': 5.61} 94%|█████████▎| 561/600 [1:41:54<07:01, 10.80s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0009, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0814, device='cuda:0', grad_fn=) [2024-06-18 23:57:06,181] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3472.93 | bwd_microstep: 1738.19 | bwd_inner_microstep: 1733.27 | bwd_allreduce_microstep: 4.76 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.2789, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.3095, device='cuda:0', grad_fn=) [2024-06-18 23:57:11,904] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:57:11,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3580.86 | bwd_microstep: 1974.67 | bwd_inner_microstep: 1969.29 | bwd_allreduce_microstep: 5.28 | step_microstep: 60.62 [2024-06-18 23:57:11,905] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7053.75 | bwd: 3712.86 | bwd_inner: 3702.61 | bwd_allreduce: 10.04 | step: 60.70 94%|█████████▎| 562/600 [1:42:05<06:52, 10.87s/it] {'loss': 0.1955, 'learning_rate': 1.0481844432158161e-06, 'epoch': 5.62} 94%|█████████▎| 562/600 [1:42:05<06:52, 10.87s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0062, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0866, device='cuda:0', grad_fn=) [2024-06-18 23:57:17,129] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3442.15 | bwd_microstep: 1691.91 | bwd_inner_microstep: 1686.85 | bwd_allreduce_microstep: 4.88 | step_microstep: 0.13 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0347, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1123, device='cuda:0', grad_fn=) [2024-06-18 23:57:21,694] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:57:21,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2698.43 | bwd_microstep: 1718.06 | bwd_inner_microstep: 1712.69 | bwd_allreduce_microstep: 5.25 | step_microstep: 61.04 [2024-06-18 23:57:21,695] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6140.53 | bwd: 3409.96 | bwd_inner: 3399.60 | bwd_allreduce: 10.12 | step: 61.17 94%|█████████▍| 563/600 [1:42:14<06:30, 10.54s/it] {'loss': 0.0995, 'learning_rate': 9.939238411647235e-07, 'epoch': 5.63} 94%|█████████▍| 563/600 [1:42:14<06:30, 10.54s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0005, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0815, device='cuda:0', grad_fn=) [2024-06-18 23:57:27,086] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3494.95 | bwd_microstep: 1803.76 | bwd_inner_microstep: 1798.88 | bwd_allreduce_microstep: 4.78 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3027, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.3313, device='cuda:0', grad_fn=) [2024-06-18 23:57:32,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:57:32,738] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3559.49 | bwd_microstep: 1931.53 | bwd_inner_microstep: 1926.06 | bwd_allreduce_microstep: 5.30 | step_microstep: 60.70 [2024-06-18 23:57:32,739] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7054.41 | bwd: 3735.29 | bwd_inner: 3724.99 | bwd_allreduce: 10.06 | step: 60.79 94%|█████████▍| 564/600 [1:42:25<06:24, 10.69s/it] {'loss': 0.2064, 'learning_rate': 9.410911550880475e-07, 'epoch': 5.64} 94%|█████████▍| 564/600 [1:42:25<06:24, 10.69s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0444, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1210, device='cuda:0', grad_fn=) [2024-06-18 23:57:38,046] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3475.68 | bwd_microstep: 1740.09 | bwd_inner_microstep: 1735.28 | bwd_allreduce_microstep: 4.73 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4914, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.5014, device='cuda:0', grad_fn=) [2024-06-18 23:57:43,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-18 23:57:43,669] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3551.57 | bwd_microstep: 1904.14 | bwd_inner_microstep: 1898.49 | bwd_allreduce_microstep: 5.53 | step_microstep: 62.95 [2024-06-18 23:57:43,670] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7027.21 | bwd: 3644.22 | bwd_inner: 3633.78 | bwd_allreduce: 10.27 | step: 63.03 94%|█████████▍| 565/600 [1:42:36<06:16, 10.76s/it] {'loss': 0.3112, 'learning_rate': 8.896879243997347e-07, 'epoch': 5.65} 94%|█████████▍| 565/600 [1:42:36<06:16, 10.76s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5451, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.5605, device='cuda:0', grad_fn=) [2024-06-18 23:57:49,242] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3553.73 | bwd_microstep: 1917.19 | bwd_inner_microstep: 1912.33 | bwd_allreduce_microstep: 4.67 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.2398, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.2854, device='cuda:0', grad_fn=) [2024-06-18 23:57:54,863] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:57:54,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3550.76 | bwd_microstep: 1914.07 | bwd_inner_microstep: 1908.58 | bwd_allreduce_microstep: 5.31 | step_microstep: 60.94 [2024-06-18 23:57:54,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7104.47 | bwd: 3831.25 | bwd_inner: 3821.01 | bwd_allreduce: 9.96 | step: 61.02 94%|█████████▍| 566/600 [1:42:48<06:10, 10.89s/it] {'loss': 0.4229, 'learning_rate': 8.397156468629208e-07, 'epoch': 5.66} 94%|█████████▍| 566/600 [1:42:48<06:10, 10.89s/it]warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0021, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0830, device='cuda:0', grad_fn=) [2024-06-18 23:57:59,749] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3135.94 | bwd_microstep: 1659.30 | bwd_inner_microstep: 1654.41 | bwd_allreduce_microstep: 4.78 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0005, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0815, device='cuda:0', grad_fn=) [2024-06-18 23:58:05,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-18 23:58:05,210] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3496.69 | bwd_microstep: 1811.17 | bwd_inner_microstep: 1805.65 | bwd_allreduce_microstep: 5.42 | step_microstep: 61.81 [2024-06-18 23:58:05,211] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6632.58 | bwd: 3470.46 | bwd_inner: 3460.08 | bwd_allreduce: 10.20 | step: 61.89 94%|█████████▍| 567/600 [1:42:58<05:54, 10.73s/it] {'loss': 0.0823, 'learning_rate': 7.911757785462881e-07, 'epoch': 5.67} 94%|█████████▍| 567/600 [1:42:58<05:54, 10.73s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7050, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.7041, device='cuda:0', grad_fn=) [2024-06-18 23:58:10,727] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3539.45 | bwd_microstep: 1879.53 | bwd_inner_microstep: 1874.71 | bwd_allreduce_microstep: 4.72 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4595, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.4834, device='cuda:0', grad_fn=) [2024-06-18 23:58:16,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.81 [2024-06-18 23:58:16,337] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3548.03 | bwd_microstep: 1906.74 | bwd_inner_microstep: 1901.27 | bwd_allreduce_microstep: 5.30 | step_microstep: 60.95 [2024-06-18 23:58:16,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7087.46 | bwd: 3786.27 | bwd_inner: 3776.03 | bwd_allreduce: 10.01 | step: 61.03 95%|█████████▍| 568/600 [1:43:09<05:47, 10.85s/it] {'loss': 0.5938, 'learning_rate': 7.44069733781677e-07, 'epoch': 5.68} 95%|█████████▍| 568/600 [1:43:09<05:47, 10.85s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.8000, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.7903, device='cuda:0', grad_fn=) [2024-06-18 23:58:18,388] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 1448.65 | bwd_microstep: 543.62 | bwd_inner_microstep: 538.61 | bwd_allreduce_microstep: 4.83 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6241, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.6208, device='cuda:0', grad_fn=) [2024-06-18 23:58:24,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-18 23:58:24,094] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3573.43 | bwd_microstep: 1968.69 | bwd_inner_microstep: 1963.36 | bwd_allreduce_microstep: 5.22 | step_microstep: 61.11 [2024-06-18 23:58:24,095] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5022.06 | bwd: 2512.30 | bwd_inner: 2502.03 | bwd_allreduce: 10.04 | step: 61.20 95%|█████████▍| 569/600 [1:43:17<05:07, 9.92s/it] {'loss': 0.7056, 'learning_rate': 6.983988851228473e-07, 'epoch': 5.69} 95%|█████████▍| 569/600 [1:43:17<05:07, 9.92s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0269, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1053, device='cuda:0', grad_fn=) [2024-06-18 23:58:29,487] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3495.09 | bwd_microstep: 1803.83 | bwd_inner_microstep: 1798.90 | bwd_allreduce_microstep: 4.77 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0046, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0856, device='cuda:0', grad_fn=) [2024-06-18 23:58:34,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:58:34,935] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3493.93 | bwd_microstep: 1805.96 | bwd_inner_microstep: 1800.65 | bwd_allreduce_microstep: 5.19 | step_microstep: 60.81 [2024-06-18 23:58:34,936] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6988.97 | bwd: 3609.78 | bwd_inner: 3599.61 | bwd_allreduce: 9.97 | step: 60.89 95%|█████████▌| 570/600 [1:43:28<05:05, 10.20s/it] {'loss': 0.0954, 'learning_rate': 6.54164563305465e-07, 'epoch': 5.7} 95%|█████████▌| 570/600 [1:43:28<05:05, 10.20s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8863, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.8672, device='cuda:0', grad_fn=) [2024-06-18 23:58:40,617] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3588.40 | bwd_microstep: 1985.71 | bwd_inner_microstep: 1980.66 | bwd_allreduce_microstep: 4.87 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6262, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.6224, device='cuda:0', grad_fn=) [2024-06-18 23:58:46,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-18 23:58:46,137] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3513.73 | bwd_microstep: 1846.26 | bwd_inner_microstep: 1840.86 | bwd_allreduce_microstep: 5.30 | step_microstep: 61.46 [2024-06-18 23:58:46,138] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7102.09 | bwd: 3831.96 | bwd_inner: 3821.57 | bwd_allreduce: 10.17 | step: 61.54 95%|█████████▌| 571/600 [1:43:39<05:04, 10.50s/it] {'loss': 0.7448, 'learning_rate': 6.113680572083946e-07, 'epoch': 5.71} 95%|█████████▌| 571/600 [1:43:39<05:04, 10.50s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4854, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.5072, device='cuda:0', grad_fn=) [2024-06-18 23:58:50,788] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2745.86 | bwd_microstep: 1805.12 | bwd_inner_microstep: 1800.17 | bwd_allreduce_microstep: 4.78 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1280) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1280, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6347, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.6300, device='cuda:0', grad_fn=) [2024-06-18 23:58:56,090] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-18 23:58:56,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3247.83 | bwd_microstep: 1892.69 | bwd_inner_microstep: 1887.32 | bwd_allreduce_microstep: 5.27 | step_microstep: 61.01 [2024-06-18 23:58:56,091] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5993.66 | bwd: 3697.80 | bwd_inner: 3687.54 | bwd_allreduce: 10.03 | step: 61.09 95%|█████████▌| 572/600 [1:43:49<04:49, 10.33s/it] {'loss': 0.5686, 'learning_rate': 5.700106138160688e-07, 'epoch': 5.72} 95%|█████████▌| 572/600 [1:43:49<04:49, 10.33s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0074, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0881, device='cuda:0', grad_fn=) [2024-06-18 23:59:01,226] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3407.59 | bwd_microstep: 1638.03 | bwd_inner_microstep: 1633.00 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3545, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.3893, device='cuda:0', grad_fn=) [2024-06-18 23:59:06,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:59:06,821] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3548.84 | bwd_microstep: 1890.90 | bwd_inner_microstep: 1885.54 | bwd_allreduce_microstep: 5.26 | step_microstep: 60.50 [2024-06-18 23:59:06,822] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6956.41 | bwd: 3528.93 | bwd_inner: 3518.58 | bwd_allreduce: 10.10 | step: 60.58 96%|█████████▌| 573/600 [1:43:59<04:42, 10.45s/it] {'loss': 0.2387, 'learning_rate': 5.300934381821998e-07, 'epoch': 5.73} 96%|█████████▌| 573/600 [1:43:59<04:42, 10.45s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0029, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0841, device='cuda:0', grad_fn=) [2024-06-18 23:59:12,205] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3491.86 | bwd_microstep: 1799.38 | bwd_inner_microstep: 1794.42 | bwd_allreduce_microstep: 4.79 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5474, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.5514, device='cuda:0', grad_fn=) [2024-06-18 23:59:17,864] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-18 23:59:17,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3562.81 | bwd_microstep: 1933.30 | bwd_inner_microstep: 1927.82 | bwd_allreduce_microstep: 5.30 | step_microstep: 60.73 [2024-06-18 23:59:17,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7054.65 | bwd: 3732.67 | bwd_inner: 3722.32 | bwd_allreduce: 10.04 | step: 60.81 96%|█████████▌| 574/600 [1:44:11<04:36, 10.63s/it] {'loss': 0.3178, 'learning_rate': 4.916176933946693e-07, 'epoch': 5.74} 96%|█████████▌| 574/600 [1:44:11<04:36, 10.63s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.1910, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.2422, device='cuda:0', grad_fn=) [2024-06-18 23:59:23,446] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.07 | bwd_microstep: 1921.80 | bwd_inner_microstep: 1916.87 | bwd_allreduce_microstep: 4.78 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6382, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.6332, device='cuda:0', grad_fn=) [2024-06-18 23:59:29,096] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-18 23:59:29,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3561.55 | bwd_microstep: 1926.39 | bwd_inner_microstep: 1920.82 | bwd_allreduce_microstep: 5.46 | step_microstep: 61.39 [2024-06-18 23:59:29,097] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7117.60 | bwd: 3848.19 | bwd_inner: 3837.74 | bwd_allreduce: 10.25 | step: 61.47 96%|█████████▌| 575/600 [1:44:22<04:30, 10.81s/it] {'loss': 0.4377, 'learning_rate': 4.545845005415994e-07, 'epoch': 5.75} 96%|█████████▌| 575/600 [1:44:22<04:30, 10.81s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0237, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.1020, device='cuda:0', grad_fn=) [2024-06-18 23:59:34,316] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3440.62 | bwd_microstep: 1689.93 | bwd_inner_microstep: 1684.97 | bwd_allreduce_microstep: 4.79 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5900, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.6013, device='cuda:0', grad_fn=) [2024-06-18 23:59:39,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-18 23:59:39,941] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3551.18 | bwd_microstep: 1916.17 | bwd_inner_microstep: 1910.63 | bwd_allreduce_microstep: 5.36 | step_microstep: 61.04 [2024-06-18 23:59:39,942] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6991.79 | bwd: 3606.10 | bwd_inner: 3595.69 | bwd_allreduce: 10.10 | step: 61.12 96%|█████████▌| 576/600 [1:44:33<04:19, 10.82s/it] {'loss': 0.3517, 'learning_rate': 4.189949386787462e-07, 'epoch': 5.76} 96%|█████████▌| 576/600 [1:44:33<04:19, 10.82s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7676, device='cuda:0', grad_fn=) tensor(0.5913, device='cuda:0', grad_fn=) tensor(0.7500, device='cuda:0', grad_fn=) [2024-06-18 23:59:44,830] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2945.13 | bwd_microstep: 1838.72 | bwd_inner_microstep: 1833.87 | bwd_allreduce_microstep: 4.75 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6600, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.6525, device='cuda:0', grad_fn=) [2024-06-18 23:59:50,484] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-18 23:59:50,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3556.53 | bwd_microstep: 1934.47 | bwd_inner_microstep: 1928.88 | bwd_allreduce_microstep: 5.41 | step_microstep: 60.83 [2024-06-18 23:59:50,485] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6501.64 | bwd: 3773.18 | bwd_inner: 3762.80 | bwd_allreduce: 10.14 | step: 60.90 96%|█████████▌| 577/600 [1:44:43<04:06, 10.74s/it] {'loss': 0.7012, 'learning_rate': 3.848500447979908e-07, 'epoch': 5.77} 96%|█████████▌| 577/600 [1:44:43<04:06, 10.74s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5621, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.5762, device='cuda:0', grad_fn=) [2024-06-18 23:59:56,063] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3553.37 | bwd_microstep: 1921.55 | bwd_inner_microstep: 1916.69 | bwd_allreduce_microstep: 4.75 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0111, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0911, device='cuda:0', grad_fn=) [2024-06-19 00:00:01,510] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-19 00:00:01,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3495.08 | bwd_microstep: 1804.80 | bwd_inner_microstep: 1799.40 | bwd_allreduce_microstep: 5.29 | step_microstep: 60.83 [2024-06-19 00:00:01,511] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7048.43 | bwd: 3726.34 | bwd_inner: 3716.10 | bwd_allreduce: 10.05 | step: 60.91 96%|█████████▋| 578/600 [1:44:54<03:58, 10.82s/it] {'loss': 0.3336, 'learning_rate': 3.5215081379718074e-07, 'epoch': 5.78} 96%|█████████▋| 578/600 [1:44:54<03:58, 10.82s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5717, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5845, device='cuda:0', grad_fn=) [2024-06-19 00:00:07,078] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3549.69 | bwd_microstep: 1915.25 | bwd_inner_microstep: 1910.51 | bwd_allreduce_microstep: 4.67 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4504, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.4752, device='cuda:0', grad_fn=) [2024-06-19 00:00:12,678] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.83 [2024-06-19 00:00:12,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3547.57 | bwd_microstep: 1896.60 | bwd_inner_microstep: 1891.30 | bwd_allreduce_microstep: 5.23 | step_microstep: 60.70 [2024-06-19 00:00:12,679] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7097.24 | bwd: 3811.84 | bwd_inner: 3801.79 | bwd_allreduce: 9.92 | step: 60.78 96%|█████████▋| 579/600 [1:45:05<03:49, 10.93s/it] {'loss': 0.5298, 'learning_rate': 3.208981984511195e-07, 'epoch': 5.79} 96%|█████████▋| 579/600 [1:45:05<03:49, 10.93s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.2685, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.3116, device='cuda:0', grad_fn=) [2024-06-19 00:00:18,201] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3538.44 | bwd_microstep: 1885.70 | bwd_inner_microstep: 1880.32 | bwd_allreduce_microstep: 5.25 | step_microstep: 0.10 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(1.2648, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(1.2082, device='cuda:0', grad_fn=) [2024-06-19 00:00:22,619] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-19 00:00:22,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2651.84 | bwd_microstep: 1615.99 | bwd_inner_microstep: 1610.61 | bwd_allreduce_microstep: 5.27 | step_microstep: 61.31 [2024-06-19 00:00:22,620] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6190.26 | bwd: 3501.68 | bwd_inner: 3490.96 | bwd_allreduce: 10.52 | step: 61.41 97%|█████████▋| 580/600 [1:45:15<03:32, 10.63s/it] {'loss': 0.7599, 'learning_rate': 2.9109310938378877e-07, 'epoch': 5.8} 97%|█████████▋| 580/600 [1:45:15<03:32, 10.63s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7943, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7849, device='cuda:0', grad_fn=) [2024-06-19 00:00:28,153] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3541.94 | bwd_microstep: 1891.67 | bwd_inner_microstep: 1886.68 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.5867, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.5865, device='cuda:0', grad_fn=) [2024-06-19 00:00:32,973] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.90 [2024-06-19 00:00:32,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2778.64 | bwd_microstep: 1878.38 | bwd_inner_microstep: 1872.99 | bwd_allreduce_microstep: 5.28 | step_microstep: 60.84 [2024-06-19 00:00:32,974] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6320.53 | bwd: 3770.04 | bwd_inner: 3759.73 | bwd_allreduce: 10.07 | step: 60.92 97%|█████████▋| 581/600 [1:45:26<03:20, 10.55s/it] {'loss': 0.6857, 'learning_rate': 2.6273641504184766e-07, 'epoch': 5.81} 97%|█████████▋| 581/600 [1:45:26<03:20, 10.55s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0115, device='cuda:0', grad_fn=) tensor(0.8066, device='cuda:0', grad_fn=) tensor(0.0910, device='cuda:0', grad_fn=) [2024-06-19 00:00:38,369] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3493.59 | bwd_microstep: 1806.84 | bwd_inner_microstep: 1801.78 | bwd_allreduce_microstep: 4.95 | step_microstep: 0.13 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0433, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.1205, device='cuda:0', grad_fn=) [2024-06-19 00:00:43,743] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.81 [2024-06-19 00:00:43,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3476.86 | bwd_microstep: 1746.36 | bwd_inner_microstep: 1741.02 | bwd_allreduce_microstep: 5.24 | step_microstep: 60.36 [2024-06-19 00:00:43,744] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6970.42 | bwd: 3553.19 | bwd_inner: 3542.81 | bwd_allreduce: 10.19 | step: 60.50 97%|█████████▋| 582/600 [1:45:36<03:11, 10.61s/it] {'loss': 0.1058, 'learning_rate': 2.3582894166930268e-07, 'epoch': 5.82} 97%|█████████▋| 582/600 [1:45:36<03:11, 10.61s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5252, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.5312, device='cuda:0', grad_fn=) [2024-06-19 00:00:49,398] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3581.30 | bwd_microstep: 1966.20 | bwd_inner_microstep: 1961.23 | bwd_allreduce_microstep: 4.80 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0821, device='cuda:0', grad_fn=) tensor(0.7031, device='cuda:0', grad_fn=) tensor(0.1442, device='cuda:0', grad_fn=) [2024-06-19 00:00:55,098] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-19 00:00:55,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3579.87 | bwd_microstep: 1959.61 | bwd_inner_microstep: 1954.25 | bwd_allreduce_microstep: 5.24 | step_microstep: 60.83 [2024-06-19 00:00:55,099] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7161.13 | bwd: 3925.80 | bwd_inner: 3915.54 | bwd_allreduce: 10.03 | step: 60.92 97%|█████████▋| 583/600 [1:45:48<03:04, 10.84s/it] {'loss': 0.3377, 'learning_rate': 2.1037147328344387e-07, 'epoch': 5.83} 97%|█████████▋| 583/600 [1:45:48<03:04, 10.84s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0021, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0830, device='cuda:0', grad_fn=) [2024-06-19 00:01:00,322] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3443.05 | bwd_microstep: 1691.23 | bwd_inner_microstep: 1686.35 | bwd_allreduce_microstep: 4.77 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3263, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.3524, device='cuda:0', grad_fn=) [2024-06-19 00:01:05,988] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-19 00:01:05,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3563.74 | bwd_microstep: 1938.13 | bwd_inner_microstep: 1932.76 | bwd_allreduce_microstep: 5.21 | step_microstep: 60.80 [2024-06-19 00:01:05,989] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7006.75 | bwd: 3629.35 | bwd_inner: 3619.16 | bwd_allreduce: 9.99 | step: 60.89 97%|█████████▋| 584/600 [1:45:59<02:53, 10.85s/it] {'loss': 0.2177, 'learning_rate': 1.8636475165200174e-07, 'epoch': 5.84} 97%|█████████▋| 584/600 [1:45:59<02:53, 10.85s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0854, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1579, device='cuda:0', grad_fn=) [2024-06-19 00:01:11,538] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3656.47 | bwd_microstep: 1800.19 | bwd_inner_microstep: 1795.15 | bwd_allreduce_microstep: 4.90 | step_microstep: 0.08 warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.2167, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.2646, device='cuda:0', grad_fn=) [2024-06-19 00:01:16,117] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-19 00:01:16,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2701.78 | bwd_microstep: 1723.10 | bwd_inner_microstep: 1717.60 | bwd_allreduce_microstep: 5.32 | step_microstep: 61.14 [2024-06-19 00:01:16,118] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6358.24 | bwd: 3523.28 | bwd_inner: 3512.83 | bwd_allreduce: 10.20 | step: 61.22 98%|█████████▊| 585/600 [1:46:09<02:39, 10.64s/it] {'loss': 0.2113, 'learning_rate': 1.6380947627153143e-07, 'epoch': 5.85} 98%|█████████▊| 585/600 [1:46:09<02:39, 10.64s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5871, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.5984, device='cuda:0', grad_fn=) [2024-06-19 00:01:21,656] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3547.77 | bwd_microstep: 1891.71 | bwd_inner_microstep: 1886.85 | bwd_allreduce_microstep: 4.75 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8157, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.7930, device='cuda:0', grad_fn=) [2024-06-19 00:01:27,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-19 00:01:27,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3561.44 | bwd_microstep: 1937.70 | bwd_inner_microstep: 1932.30 | bwd_allreduce_microstep: 5.23 | step_microstep: 60.98 [2024-06-19 00:01:27,320] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7109.19 | bwd: 3829.40 | bwd_inner: 3819.20 | bwd_allreduce: 9.99 | step: 61.06 98%|█████████▊| 586/600 [1:46:20<02:31, 10.81s/it] {'loss': 0.6957, 'learning_rate': 1.427063043470178e-07, 'epoch': 5.86} 98%|█████████▊| 586/600 [1:46:20<02:31, 10.81s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.4209, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.4484, device='cuda:0', grad_fn=) [2024-06-19 00:01:32,011] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2756.26 | bwd_microstep: 1833.69 | bwd_inner_microstep: 1828.78 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8605, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.8332, device='cuda:0', grad_fn=) [2024-06-19 00:01:37,759] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-19 00:01:37,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3588.17 | bwd_microstep: 1988.45 | bwd_inner_microstep: 1983.08 | bwd_allreduce_microstep: 5.27 | step_microstep: 61.70 [2024-06-19 00:01:37,760] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6344.40 | bwd: 3822.14 | bwd_inner: 3811.87 | bwd_allreduce: 10.08 | step: 61.79 98%|█████████▊| 587/600 [1:46:30<02:19, 10.70s/it] {'loss': 0.6408, 'learning_rate': 1.2305585077276306e-07, 'epoch': 5.87} 98%|█████████▊| 587/600 [1:46:30<02:19, 10.70s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0014, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0827, device='cuda:0', grad_fn=) [2024-06-19 00:01:43,064] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3473.93 | bwd_microstep: 1738.74 | bwd_inner_microstep: 1733.83 | bwd_allreduce_microstep: 4.81 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5719, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.5843, device='cuda:0', grad_fn=) [2024-06-19 00:01:48,671] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-19 00:01:48,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3552.76 | bwd_microstep: 1896.59 | bwd_inner_microstep: 1891.08 | bwd_allreduce_microstep: 5.34 | step_microstep: 61.00 [2024-06-19 00:01:48,672] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7026.65 | bwd: 3635.32 | bwd_inner: 3624.95 | bwd_allreduce: 10.14 | step: 61.08 98%|█████████▊| 588/600 [1:46:41<02:09, 10.76s/it] {'loss': 0.3335, 'learning_rate': 1.0485868811441757e-07, 'epoch': 5.88} 98%|█████████▊| 588/600 [1:46:41<02:09, 10.76s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7441, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7396, device='cuda:0', grad_fn=) [2024-06-19 00:01:54,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3579.40 | bwd_microstep: 1956.09 | bwd_inner_microstep: 1951.29 | bwd_allreduce_microstep: 4.69 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.2121, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.2608, device='cuda:0', grad_fn=) [2024-06-19 00:01:59,953] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-19 00:01:59,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3555.56 | bwd_microstep: 1926.66 | bwd_inner_microstep: 1921.00 | bwd_allreduce_microstep: 5.53 | step_microstep: 63.06 [2024-06-19 00:01:59,954] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7134.92 | bwd: 3882.74 | bwd_inner: 3872.31 | bwd_allreduce: 10.22 | step: 63.14 98%|█████████▊| 589/600 [1:46:53<02:00, 10.92s/it] {'loss': 0.5002, 'learning_rate': 8.811534659234899e-08, 'epoch': 5.89} 98%|█████████▊| 589/600 [1:46:53<02:00, 10.92s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5865, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.5863, device='cuda:0', grad_fn=) [2024-06-19 00:02:05,604] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3577.57 | bwd_microstep: 1965.26 | bwd_inner_microstep: 1960.35 | bwd_allreduce_microstep: 4.75 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.9391, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.9148, device='cuda:0', grad_fn=) [2024-06-19 00:02:11,340] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.91 [2024-06-19 00:02:11,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3588.01 | bwd_microstep: 1982.97 | bwd_inner_microstep: 1977.58 | bwd_allreduce_microstep: 5.27 | step_microstep: 61.00 [2024-06-19 00:02:11,341] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7165.54 | bwd: 3948.22 | bwd_inner: 3937.99 | bwd_allreduce: 10.03 | step: 61.08 98%|█████████▊| 590/600 [1:47:04<01:50, 11.06s/it] {'loss': 0.7505, 'learning_rate': 7.282631406615447e-08, 'epoch': 5.9} 98%|█████████▊| 590/600 [1:47:04<01:50, 11.06s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6206, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.6285, device='cuda:0', grad_fn=) [2024-06-19 00:02:16,865] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3541.76 | bwd_microstep: 1885.58 | bwd_inner_microstep: 1880.62 | bwd_allreduce_microstep: 4.82 | step_microstep: 0.13 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.5133, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.5208, device='cuda:0', grad_fn=) [2024-06-19 00:02:22,585] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-19 00:02:22,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3581.58 | bwd_microstep: 1972.04 | bwd_inner_microstep: 1966.55 | bwd_allreduce_microstep: 5.32 | step_microstep: 60.79 [2024-06-19 00:02:22,586] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7123.31 | bwd: 3857.62 | bwd_inner: 3847.25 | bwd_allreduce: 10.12 | step: 60.92 98%|█████████▊| 591/600 [1:47:15<01:40, 11.11s/it] {'loss': 0.5746, 'learning_rate': 5.899203602046655e-08, 'epoch': 5.91} 98%|█████████▊| 591/600 [1:47:15<01:40, 11.11s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.2458, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.2908, device='cuda:0', grad_fn=) [2024-06-19 00:02:28,125] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3546.47 | bwd_microstep: 1893.42 | bwd_inner_microstep: 1888.50 | bwd_allreduce_microstep: 4.74 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4866, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.4967, device='cuda:0', grad_fn=) [2024-06-19 00:02:33,732] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.82 [2024-06-19 00:02:33,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3551.36 | bwd_microstep: 1898.64 | bwd_inner_microstep: 1893.18 | bwd_allreduce_microstep: 5.29 | step_microstep: 60.78 [2024-06-19 00:02:33,733] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7097.79 | bwd: 3792.06 | bwd_inner: 3781.77 | bwd_allreduce: 10.01 | step: 60.86 99%|█████████▊| 592/600 [1:47:26<01:28, 11.12s/it] {'loss': 0.3938, 'learning_rate': 4.661291555196345e-08, 'epoch': 5.92} 99%|█████████▊| 592/600 [1:47:26<01:28, 11.12s/it]warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.2738, device='cuda:0', grad_fn=) tensor(0.6948, device='cuda:0', grad_fn=) tensor(0.3159, device='cuda:0', grad_fn=) [2024-06-19 00:02:38,311] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2842.28 | bwd_microstep: 1640.24 | bwd_inner_microstep: 1635.27 | bwd_allreduce_microstep: 4.86 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (768) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([768, 6144]) tensor(0.6999, device='cuda:0', grad_fn=) tensor(0.5880, device='cuda:0', grad_fn=) tensor(0.6887, device='cuda:0', grad_fn=) [2024-06-19 00:02:43,294] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-19 00:02:43,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2954.37 | bwd_microstep: 1865.98 | bwd_inner_microstep: 1860.66 | bwd_allreduce_microstep: 5.21 | step_microstep: 60.64 [2024-06-19 00:02:43,295] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 5796.60 | bwd: 3506.22 | bwd_inner: 3495.95 | bwd_allreduce: 10.08 | step: 60.72 99%|█████████▉| 593/600 [1:47:36<01:14, 10.66s/it] {'loss': 0.5023, 'learning_rate': 3.5689313357634145e-08, 'epoch': 5.93} 99%|█████████▉| 593/600 [1:47:36<01:14, 10.66s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.4762, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.4985, device='cuda:0', grad_fn=) [2024-06-19 00:02:48,825] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3544.98 | bwd_microstep: 1887.91 | bwd_inner_microstep: 1883.09 | bwd_allreduce_microstep: 4.71 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7465, device='cuda:0', grad_fn=) tensor(0.5847, device='cuda:0', grad_fn=) tensor(0.7304, device='cuda:0', grad_fn=) [2024-06-19 00:02:54,575] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-19 00:02:54,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3593.11 | bwd_microstep: 1988.72 | bwd_inner_microstep: 1983.17 | bwd_allreduce_microstep: 5.45 | step_microstep: 61.23 [2024-06-19 00:02:54,576] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7138.08 | bwd: 3876.63 | bwd_inner: 3866.28 | bwd_allreduce: 10.17 | step: 61.31 99%|█████████▉| 594/600 [1:47:47<01:05, 10.84s/it] {'loss': 0.6144, 'learning_rate': 2.6221547724253337e-08, 'epoch': 5.94} 99%|█████████▉| 594/600 [1:47:47<01:05, 10.84s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0539, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.1296, device='cuda:0', grad_fn=) [2024-06-19 00:02:59,799] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3443.13 | bwd_microstep: 1691.85 | bwd_inner_microstep: 1687.07 | bwd_allreduce_microstep: 4.67 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.7057, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.7051, device='cuda:0', grad_fn=) [2024-06-19 00:03:05,435] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.92 [2024-06-19 00:03:05,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3557.35 | bwd_microstep: 1918.59 | bwd_inner_microstep: 1913.18 | bwd_allreduce_microstep: 5.30 | step_microstep: 60.93 [2024-06-19 00:03:05,436] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7000.46 | bwd: 3610.43 | bwd_inner: 3600.26 | bwd_allreduce: 9.98 | step: 61.01 99%|█████████▉| 595/600 [1:47:58<00:54, 10.85s/it] {'loss': 0.4173, 'learning_rate': 1.8209894519122252e-08, 'epoch': 5.95} 99%|█████████▉| 595/600 [1:47:58<00:54, 10.85s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6470, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.6519, device='cuda:0', grad_fn=) [2024-06-19 00:03:11,003] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3554.68 | bwd_microstep: 1911.44 | bwd_inner_microstep: 1906.61 | bwd_allreduce_microstep: 4.73 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1024) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1024, 6144]) tensor(0.4763, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.4986, device='cuda:0', grad_fn=) [2024-06-19 00:03:15,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.89 [2024-06-19 00:03:15,890] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3019.73 | bwd_microstep: 1710.15 | bwd_inner_microstep: 1704.61 | bwd_allreduce_microstep: 5.35 | step_microstep: 61.14 [2024-06-19 00:03:15,891] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6574.37 | bwd: 3621.58 | bwd_inner: 3611.28 | bwd_allreduce: 10.06 | step: 61.22 99%|█████████▉| 596/600 [1:48:09<00:42, 10.73s/it] {'loss': 0.5752, 'learning_rate': 1.1654587182013953e-08, 'epoch': 5.96} 99%|█████████▉| 596/600 [1:48:09<00:42, 10.73s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0018, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0831, device='cuda:0', grad_fn=) [2024-06-19 00:03:21,112] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3441.28 | bwd_microstep: 1691.72 | bwd_inner_microstep: 1686.80 | bwd_allreduce_microstep: 4.75 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0002, device='cuda:0', grad_fn=) tensor(0.8107, device='cuda:0', grad_fn=) tensor(0.0813, device='cuda:0', grad_fn=) [2024-06-19 00:03:26,566] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.84 [2024-06-19 00:03:26,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3495.18 | bwd_microstep: 1809.21 | bwd_inner_microstep: 1803.81 | bwd_allreduce_microstep: 5.29 | step_microstep: 60.71 [2024-06-19 00:03:26,567] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6936.43 | bwd: 3500.93 | bwd_inner: 3490.67 | bwd_allreduce: 10.02 | step: 60.78 100%|█████████▉| 597/600 [1:48:19<00:32, 10.71s/it] {'loss': 0.0822, 'learning_rate': 6.5558167183898955e-09, 'epoch': 5.97} 100%|█████████▉| 597/600 [1:48:19<00:32, 10.71s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) tensor(0.4734, device='cuda:0', grad_fn=) tensor(0.6998, device='cuda:0', grad_fn=) tensor(0.4961, device='cuda:0', grad_fn=) [2024-06-19 00:03:31,319] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2780.22 | bwd_microstep: 1870.90 | bwd_inner_microstep: 1866.09 | bwd_allreduce_microstep: 4.71 | step_microstep: 0.07 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.6071, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.6163, device='cuda:0', grad_fn=) [2024-06-19 00:03:36,945] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.85 [2024-06-19 00:03:36,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3551.44 | bwd_microstep: 1917.19 | bwd_inner_microstep: 1911.75 | bwd_allreduce_microstep: 5.34 | step_microstep: 61.04 [2024-06-19 00:03:36,946] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6331.61 | bwd: 3788.08 | bwd_inner: 3777.85 | bwd_allreduce: 10.05 | step: 61.11 100%|█████████▉| 598/600 [1:48:30<00:21, 10.61s/it] {'loss': 0.5562, 'learning_rate': 2.9137316938265825e-09, 'epoch': 5.98} 100%|█████████▉| 598/600 [1:48:30<00:21, 10.61s/it]warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.0046, device='cuda:0', grad_fn=) tensor(0.8148, device='cuda:0', grad_fn=) tensor(0.0856, device='cuda:0', grad_fn=) [2024-06-19 00:03:42,257] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3481.01 | bwd_microstep: 1738.19 | bwd_inner_microstep: 1733.14 | bwd_allreduce_microstep: 4.94 | step_microstep: 0.09 warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.2282, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.2749, device='cuda:0', grad_fn=) [2024-06-19 00:03:47,898] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.87 [2024-06-19 00:03:47,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3558.27 | bwd_microstep: 1923.61 | bwd_inner_microstep: 1918.08 | bwd_allreduce_microstep: 5.35 | step_microstep: 61.14 [2024-06-19 00:03:47,899] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 7039.27 | bwd: 3661.80 | bwd_inner: 3651.27 | bwd_allreduce: 10.28 | step: 61.24 100%|█████████▉| 599/600 [1:48:41<00:10, 10.72s/it] {'loss': 0.1803, 'learning_rate': 7.284382296801617e-10, 'epoch': 5.99} 100%|█████████▉| 599/600 [1:48:41<00:10, 10.72s/it]warning: The size of tensor a (0) must match the size of tensor b (256) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([256, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.8021, device='cuda:0', grad_fn=) tensor(0.6956, device='cuda:0', grad_fn=) tensor(0.7914, device='cuda:0', grad_fn=) [2024-06-19 00:03:52,338] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 2680.13 | bwd_microstep: 1661.76 | bwd_inner_microstep: 1656.83 | bwd_allreduce_microstep: 4.77 | step_microstep: 0.08 please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. please install petrel_client Replace train sampler!! petrel_client is not installed. Using PIL to load images. warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) warning: The size of tensor a (0) must match the size of tensor b (1792) at non-singleton dimension 0, input_embeds[selected].shape=torch.Size([0, 6144]), vit_embeds.shape=torch.Size([1792, 6144]) tensor(0.3445, device='cuda:0', grad_fn=) tensor(0.6989, device='cuda:0', grad_fn=) tensor(0.3799, device='cuda:0', grad_fn=) [2024-06-19 00:04:00,025] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | optimizer_step: 1.86 [2024-06-19 00:04:00,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd_microstep: 3558.02 | bwd_microstep: 1893.48 | bwd_inner_microstep: 1888.09 | bwd_allreduce_microstep: 5.29 | step_microstep: 61.05 [2024-06-19 00:04:00,026] [INFO] [logging.py:96:log_dist] [Rank 0] time (ms) | fwd: 6238.10 | bwd: 3555.24 | bwd_inner: 3544.96 | bwd_allreduce: 10.06 | step: 61.13 100%|██████████| 600/600 [1:48:53<00:00, 11.14s/it] {'loss': 0.5857, 'learning_rate': 0.0, 'epoch': 6.0} 100%|██████████| 600/600 [1:48:53<00:00, 11.14s/it][INFO|trainer.py:1962] 2024-06-19 00:04:00,039 >> Training completed. Do not forget to share your model on huggingface.co/models =) {'train_runtime': 6533.1892, 'train_samples_per_second': 0.367, 'train_steps_per_second': 0.092, 'train_loss': 0.6276818083102504, 'epoch': 6.0} 100%|██████████| 600/600 [1:48:53<00:00, 11.14s/it] 100%|██████████| 600/600 [1:48:53<00:00, 10.89s/it] [INFO|trainer.py:2936] 2024-06-19 00:04:28,089 >> Saving model checkpoint to ckpts/baseline3_combined_loss_6_epochs/ [INFO|configuration_utils.py:473] 2024-06-19 00:04:28,093 >> Configuration saved in ckpts/baseline3_combined_loss_6_epochs/config.json [INFO|configuration_utils.py:594] 2024-06-19 00:04:28,094 >> Configuration saved in ckpts/baseline3_combined_loss_6_epochs/generation_config.json [INFO|modeling_utils.py:2501] 2024-06-19 00:05:05,855 >> The model is bigger than the maximum size per checkpoint (5GB) and is going to be split in 11 checkpoint shards. You can find where each parameters has been saved in the index located at ckpts/baseline3_combined_loss_6_epochs/model.safetensors.index.json. [INFO|tokenization_utils_base.py:2433] 2024-06-19 00:05:05,860 >> tokenizer config file saved in ckpts/baseline3_combined_loss_6_epochs/tokenizer_config.json [INFO|tokenization_utils_base.py:2442] 2024-06-19 00:05:05,860 >> Special tokens file saved in ckpts/baseline3_combined_loss_6_epochs/special_tokens_map.json [INFO|tokenization_utils_base.py:2493] 2024-06-19 00:05:05,861 >> added tokens file saved in ckpts/baseline3_combined_loss_6_epochs/added_tokens.json ***** train metrics ***** epoch = 6.0 train_loss = 0.6277 train_runtime = 1:48:53.18 train_samples = 400 train_samples_per_second = 0.367 train_steps_per_second = 0.092