[2024-04-21 18:29:25,300][accelerate.utils.other][WARNING] - Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. [2024-04-21 18:29:25,302][Main][INFO] - Distributed environment: NO Num processes: 1 Process index: 0 Local process index: 0 Device: cuda Mixed precision type: bf16 [2024-04-21 18:29:25,303][Main][INFO] - Working directory is /home/jovyan/nanoT5/logs/2024-04-21/18-29-25- [2024-04-21 18:29:30,477][Main][INFO] - You are using T5 legacy LR Schedule, it's independent from the optim.base_lr [2024-04-21 18:35:57,583][Main][INFO] - [train] Step 100 out of 120000 | Loss --> 58.172 | Grad_l2 --> 49.005 | Weights_l2 --> 13617.159 | Lr --> 0.010 | Seconds_per_step --> 3.816 | [2024-04-21 18:39:28,138][Main][INFO] - [train] Step 200 out of 120000 | Loss --> 10.890 | Grad_l2 --> 6.917 | Weights_l2 --> 13632.467 | Lr --> 0.010 | Seconds_per_step --> 2.106 | [2024-04-21 18:42:56,403][Main][INFO] - [train] Step 300 out of 120000 | Loss --> 8.151 | Grad_l2 --> 2.627 | Weights_l2 --> 13671.216 | Lr --> 0.010 | Seconds_per_step --> 2.083 | [2024-04-21 18:46:26,238][Main][INFO] - [train] Step 400 out of 120000 | Loss --> 7.613 | Grad_l2 --> 1.885 | Weights_l2 --> 13711.714 | Lr --> 0.010 | Seconds_per_step --> 2.098 | [2024-04-21 18:49:58,095][Main][INFO] - [train] Step 500 out of 120000 | Loss --> 7.384 | Grad_l2 --> 1.677 | Weights_l2 --> 13753.370 | Lr --> 0.010 | Seconds_per_step --> 2.119 | [2024-04-21 18:53:29,393][Main][INFO] - [train] Step 600 out of 120000 | Loss --> 7.217 | Grad_l2 --> 1.544 | Weights_l2 --> 13798.030 | Lr --> 0.010 | Seconds_per_step --> 2.113 | [2024-04-21 18:56:58,541][Main][INFO] - [train] Step 700 out of 120000 | Loss --> 7.067 | Grad_l2 --> 1.386 | Weights_l2 --> 13844.995 | Lr --> 0.010 | Seconds_per_step --> 2.091 | [2024-04-21 19:00:27,795][Main][INFO] - [train] Step 800 out of 120000 | Loss --> 6.916 | Grad_l2 --> 1.335 | Weights_l2 --> 13892.562 | Lr --> 0.010 | Seconds_per_step --> 2.093 | [2024-04-21 19:03:58,386][Main][INFO] - [train] Step 900 out of 120000 | Loss --> 6.794 | Grad_l2 --> 1.252 | Weights_l2 --> 13941.396 | Lr --> 0.010 | Seconds_per_step --> 2.106 | [2024-04-21 19:07:32,095][Main][INFO] - [train] Step 1000 out of 120000 | Loss --> 6.693 | Grad_l2 --> 1.227 | Weights_l2 --> 13989.608 | Lr --> 0.010 | Seconds_per_step --> 2.137 | [2024-04-21 19:10:58,794][Main][INFO] - [train] Step 1100 out of 120000 | Loss --> 6.586 | Grad_l2 --> 1.195 | Weights_l2 --> 14038.925 | Lr --> 0.010 | Seconds_per_step --> 2.067 | [2024-04-21 19:14:29,664][Main][INFO] - [train] Step 1200 out of 120000 | Loss --> 6.505 | Grad_l2 --> 1.169 | Weights_l2 --> 14088.198 | Lr --> 0.010 | Seconds_per_step --> 2.109 | [2024-04-21 19:18:01,467][Main][INFO] - [train] Step 1300 out of 120000 | Loss --> 6.422 | Grad_l2 --> 1.048 | Weights_l2 --> 14140.556 | Lr --> 0.010 | Seconds_per_step --> 2.118 | [2024-04-21 19:21:32,405][Main][INFO] - [train] Step 1400 out of 120000 | Loss --> 6.351 | Grad_l2 --> 1.049 | Weights_l2 --> 14193.214 | Lr --> 0.010 | Seconds_per_step --> 2.109 | [2024-04-21 19:25:06,470][Main][INFO] - [train] Step 1500 out of 120000 | Loss --> 6.272 | Grad_l2 --> 1.082 | Weights_l2 --> 14246.153 | Lr --> 0.010 | Seconds_per_step --> 2.141 | [2024-04-21 19:28:33,397][Main][INFO] - [train] Step 1600 out of 120000 | Loss --> 6.211 | Grad_l2 --> 0.992 | Weights_l2 --> 14300.557 | Lr --> 0.010 | Seconds_per_step --> 2.069 | [2024-04-21 19:32:07,270][Main][INFO] - [train] Step 1700 out of 120000 | Loss --> 6.144 | Grad_l2 --> 0.914 | Weights_l2 --> 14357.084 | Lr --> 0.010 | Seconds_per_step --> 2.139 | [2024-04-21 19:35:36,840][Main][INFO] - [train] Step 1800 out of 120000 | Loss --> 6.095 | Grad_l2 --> 0.936 | Weights_l2 --> 14414.776 | Lr --> 0.010 | Seconds_per_step --> 2.096 | [2024-04-21 19:39:07,985][Main][INFO] - [train] Step 1900 out of 120000 | Loss --> 6.038 | Grad_l2 --> 0.921 | Weights_l2 --> 14474.008 | Lr --> 0.010 | Seconds_per_step --> 2.111 | [2024-04-21 19:42:39,397][Main][INFO] - [train] Step 2000 out of 120000 | Loss --> 5.986 | Grad_l2 --> 0.836 | Weights_l2 --> 14535.530 | Lr --> 0.010 | Seconds_per_step --> 2.114 | [2024-04-21 19:46:13,389][Main][INFO] - [train] Step 2100 out of 120000 | Loss --> 5.936 | Grad_l2 --> 0.759 | Weights_l2 --> 14597.614 | Lr --> 0.010 | Seconds_per_step --> 2.140 | [2024-04-21 19:49:47,272][Main][INFO] - [train] Step 2200 out of 120000 | Loss --> 5.891 | Grad_l2 --> 0.821 | Weights_l2 --> 14660.926 | Lr --> 0.010 | Seconds_per_step --> 2.139 | [2024-04-21 19:53:16,139][Main][INFO] - [train] Step 2300 out of 120000 | Loss --> 5.859 | Grad_l2 --> 0.701 | Weights_l2 --> 14725.462 | Lr --> 0.010 | Seconds_per_step --> 2.089 | [2024-04-21 19:56:46,367][Main][INFO] - [train] Step 2400 out of 120000 | Loss --> 5.807 | Grad_l2 --> 0.759 | Weights_l2 --> 14790.885 | Lr --> 0.010 | Seconds_per_step --> 2.102 | [2024-04-21 20:00:17,867][Main][INFO] - [train] Step 2500 out of 120000 | Loss --> 5.777 | Grad_l2 --> 0.677 | Weights_l2 --> 14857.033 | Lr --> 0.010 | Seconds_per_step --> 2.115 | [2024-04-21 20:03:51,239][Main][INFO] - [train] Step 2600 out of 120000 | Loss --> 5.733 | Grad_l2 --> 0.582 | Weights_l2 --> 14923.900 | Lr --> 0.010 | Seconds_per_step --> 2.134 | [2024-04-21 20:07:22,394][Main][INFO] - [train] Step 2700 out of 120000 | Loss --> 5.695 | Grad_l2 --> 0.645 | Weights_l2 --> 14990.602 | Lr --> 0.010 | Seconds_per_step --> 2.112 | [2024-04-21 20:10:52,365][Main][INFO] - [train] Step 2800 out of 120000 | Loss --> 5.647 | Grad_l2 --> 0.601 | Weights_l2 --> 15058.831 | Lr --> 0.010 | Seconds_per_step --> 2.100 | [2024-04-21 20:14:22,782][Main][INFO] - [train] Step 2900 out of 120000 | Loss --> 5.628 | Grad_l2 --> 0.554 | Weights_l2 --> 15126.296 | Lr --> 0.010 | Seconds_per_step --> 2.104 | [2024-04-21 20:17:52,438][Main][INFO] - [train] Step 3000 out of 120000 | Loss --> 5.602 | Grad_l2 --> 0.567 | Weights_l2 --> 15194.211 | Lr --> 0.010 | Seconds_per_step --> 2.097 | [2024-04-21 20:21:22,363][Main][INFO] - [train] Step 3100 out of 120000 | Loss --> 5.560 | Grad_l2 --> 0.540 | Weights_l2 --> 15262.732 | Lr --> 0.010 | Seconds_per_step --> 2.099 | [2024-04-21 20:24:51,894][Main][INFO] - [train] Step 3200 out of 120000 | Loss --> 5.535 | Grad_l2 --> 0.504 | Weights_l2 --> 15331.512 | Lr --> 0.010 | Seconds_per_step --> 2.095 | [2024-04-21 20:28:22,967][Main][INFO] - [train] Step 3300 out of 120000 | Loss --> 5.516 | Grad_l2 --> 0.516 | Weights_l2 --> 15400.750 | Lr --> 0.010 | Seconds_per_step --> 2.111 | [2024-04-21 20:31:53,799][Main][INFO] - [train] Step 3400 out of 120000 | Loss --> 5.487 | Grad_l2 --> 0.487 | Weights_l2 --> 15470.540 | Lr --> 0.010 | Seconds_per_step --> 2.108 | [2024-04-21 20:35:27,338][Main][INFO] - [train] Step 3500 out of 120000 | Loss --> 5.409 | Grad_l2 --> 0.516 | Weights_l2 --> 15542.298 | Lr --> 0.010 | Seconds_per_step --> 2.135 | [2024-04-21 20:38:59,893][Main][INFO] - [train] Step 3600 out of 120000 | Loss --> 5.199 | Grad_l2 --> 0.542 | Weights_l2 --> 15616.287 | Lr --> 0.010 | Seconds_per_step --> 2.126 | [2024-04-21 20:42:33,239][Main][INFO] - [train] Step 3700 out of 120000 | Loss --> 5.085 | Grad_l2 --> 0.520 | Weights_l2 --> 15690.815 | Lr --> 0.010 | Seconds_per_step --> 2.133 | [2024-04-21 20:46:00,796][Main][INFO] - [train] Step 3800 out of 120000 | Loss --> 5.026 | Grad_l2 --> 0.509 | Weights_l2 --> 15765.936 | Lr --> 0.010 | Seconds_per_step --> 2.076 | [2024-04-21 20:49:31,192][Main][INFO] - [train] Step 3900 out of 120000 | Loss --> 4.953 | Grad_l2 --> 0.500 | Weights_l2 --> 15841.784 | Lr --> 0.010 | Seconds_per_step --> 2.104 | [2024-04-21 20:53:01,242][Main][INFO] - [train] Step 4000 out of 120000 | Loss --> 4.887 | Grad_l2 --> 0.598 | Weights_l2 --> 15917.035 | Lr --> 0.010 | Seconds_per_step --> 2.100 | [2024-04-21 20:56:31,597][Main][INFO] - [train] Step 4100 out of 120000 | Loss --> 4.794 | Grad_l2 --> 0.494 | Weights_l2 --> 15992.761 | Lr --> 0.010 | Seconds_per_step --> 2.104 | [2024-04-21 21:00:00,098][Main][INFO] - [train] Step 4200 out of 120000 | Loss --> 4.724 | Grad_l2 --> 0.503 | Weights_l2 --> 16069.230 | Lr --> 0.010 | Seconds_per_step --> 2.085 | [2024-04-21 21:03:32,841][Main][INFO] - [train] Step 4300 out of 120000 | Loss --> 4.644 | Grad_l2 --> 0.481 | Weights_l2 --> 16145.430 | Lr --> 0.010 | Seconds_per_step --> 2.127 | [2024-04-21 21:07:01,068][Main][INFO] - [train] Step 4400 out of 120000 | Loss --> 4.583 | Grad_l2 --> 0.478 | Weights_l2 --> 16222.759 | Lr --> 0.010 | Seconds_per_step --> 2.082 | [2024-04-21 21:10:30,383][Main][INFO] - [train] Step 4500 out of 120000 | Loss --> 4.513 | Grad_l2 --> 0.498 | Weights_l2 --> 16300.640 | Lr --> 0.010 | Seconds_per_step --> 2.093 | [2024-04-21 21:14:05,268][Main][INFO] - [train] Step 4600 out of 120000 | Loss --> 4.468 | Grad_l2 --> 0.479 | Weights_l2 --> 16379.053 | Lr --> 0.010 | Seconds_per_step --> 2.149 | [2024-04-21 21:17:34,270][Main][INFO] - [train] Step 4700 out of 120000 | Loss --> 4.377 | Grad_l2 --> 0.475 | Weights_l2 --> 16456.336 | Lr --> 0.010 | Seconds_per_step --> 2.090 | [2024-04-21 21:21:06,881][Main][INFO] - [train] Step 4800 out of 120000 | Loss --> 4.331 | Grad_l2 --> 0.479 | Weights_l2 --> 16535.137 | Lr --> 0.010 | Seconds_per_step --> 2.126 | [2024-04-21 21:24:39,693][Main][INFO] - [train] Step 4900 out of 120000 | Loss --> 4.268 | Grad_l2 --> 0.471 | Weights_l2 --> 16614.308 | Lr --> 0.010 | Seconds_per_step --> 2.128 | [2024-04-21 21:28:09,850][Main][INFO] - [train] Step 5000 out of 120000 | Loss --> 4.240 | Grad_l2 --> 0.473 | Weights_l2 --> 16693.278 | Lr --> 0.010 | Seconds_per_step --> 2.102 | [2024-04-21 21:28:10,081][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-21 21:32:32,110][Main][INFO] - [eval] Step 5000 out of 120000 | Loss --> 4.091 | Accuracy --> 0.400 | Time --> 262.257 | [2024-04-21 21:36:02,744][Main][INFO] - [train] Step 5100 out of 120000 | Loss --> 4.193 | Grad_l2 --> 0.481 | Weights_l2 --> 16772.665 | Lr --> 0.010 | Seconds_per_step --> 2.106 | [2024-04-21 21:39:32,438][Main][INFO] - [train] Step 5200 out of 120000 | Loss --> 4.151 | Grad_l2 --> 0.469 | Weights_l2 --> 16852.078 | Lr --> 0.010 | Seconds_per_step --> 2.097 | [2024-04-21 21:43:04,994][Main][INFO] - [train] Step 5300 out of 120000 | Loss --> 4.103 | Grad_l2 --> 0.479 | Weights_l2 --> 16931.388 | Lr --> 0.010 | Seconds_per_step --> 2.126 | [2024-04-21 21:46:36,692][Main][INFO] - [train] Step 5400 out of 120000 | Loss --> 4.065 | Grad_l2 --> 0.467 | Weights_l2 --> 17010.430 | Lr --> 0.010 | Seconds_per_step --> 2.117 | [2024-04-21 21:50:09,504][Main][INFO] - [train] Step 5500 out of 120000 | Loss --> 4.031 | Grad_l2 --> 0.479 | Weights_l2 --> 17090.102 | Lr --> 0.010 | Seconds_per_step --> 2.128 | [2024-04-21 21:53:38,683][Main][INFO] - [train] Step 5600 out of 120000 | Loss --> 4.004 | Grad_l2 --> 0.506 | Weights_l2 --> 17169.930 | Lr --> 0.010 | Seconds_per_step --> 2.092 | [2024-04-21 21:57:09,839][Main][INFO] - [train] Step 5700 out of 120000 | Loss --> 3.950 | Grad_l2 --> 0.477 | Weights_l2 --> 17249.692 | Lr --> 0.010 | Seconds_per_step --> 2.112 | [2024-04-21 22:00:40,753][Main][INFO] - [train] Step 5800 out of 120000 | Loss --> 3.927 | Grad_l2 --> 0.479 | Weights_l2 --> 17329.125 | Lr --> 0.010 | Seconds_per_step --> 2.109 | [2024-04-21 22:04:10,677][Main][INFO] - [train] Step 5900 out of 120000 | Loss --> 3.884 | Grad_l2 --> 0.476 | Weights_l2 --> 17409.836 | Lr --> 0.010 | Seconds_per_step --> 2.099 | [2024-04-21 22:07:42,041][Main][INFO] - [train] Step 6000 out of 120000 | Loss --> 3.864 | Grad_l2 --> 0.481 | Weights_l2 --> 17489.466 | Lr --> 0.010 | Seconds_per_step --> 2.114 | [2024-04-21 22:11:12,067][Main][INFO] - [train] Step 6100 out of 120000 | Loss --> 3.828 | Grad_l2 --> 0.486 | Weights_l2 --> 17569.191 | Lr --> 0.010 | Seconds_per_step --> 2.100 | [2024-04-21 22:14:42,894][Main][INFO] - [train] Step 6200 out of 120000 | Loss --> 3.812 | Grad_l2 --> 0.486 | Weights_l2 --> 17648.987 | Lr --> 0.010 | Seconds_per_step --> 2.108 | [2024-04-21 22:18:10,097][Main][INFO] - [train] Step 6300 out of 120000 | Loss --> 3.782 | Grad_l2 --> 0.478 | Weights_l2 --> 17729.282 | Lr --> 0.010 | Seconds_per_step --> 2.072 | [2024-04-21 22:21:42,367][Main][INFO] - [train] Step 6400 out of 120000 | Loss --> 3.785 | Grad_l2 --> 0.482 | Weights_l2 --> 17808.175 | Lr --> 0.010 | Seconds_per_step --> 2.123 | [2024-04-21 22:25:12,502][Main][INFO] - [train] Step 6500 out of 120000 | Loss --> 3.746 | Grad_l2 --> 0.487 | Weights_l2 --> 17888.758 | Lr --> 0.010 | Seconds_per_step --> 2.101 | [2024-04-21 22:28:42,904][Main][INFO] - [train] Step 6600 out of 120000 | Loss --> 3.728 | Grad_l2 --> 0.490 | Weights_l2 --> 17969.095 | Lr --> 0.010 | Seconds_per_step --> 2.104 | [2024-04-21 22:32:13,198][Main][INFO] - [train] Step 6700 out of 120000 | Loss --> 3.712 | Grad_l2 --> 0.482 | Weights_l2 --> 18049.494 | Lr --> 0.010 | Seconds_per_step --> 2.103 | [2024-04-21 22:35:42,157][Main][INFO] - [train] Step 6800 out of 120000 | Loss --> 3.714 | Grad_l2 --> 0.490 | Weights_l2 --> 18129.605 | Lr --> 0.010 | Seconds_per_step --> 2.090 | [2024-04-21 22:39:18,667][Main][INFO] - [train] Step 6900 out of 120000 | Loss --> 3.682 | Grad_l2 --> 0.483 | Weights_l2 --> 18209.921 | Lr --> 0.010 | Seconds_per_step --> 2.165 | [2024-04-21 22:42:47,542][Main][INFO] - [train] Step 7000 out of 120000 | Loss --> 3.655 | Grad_l2 --> 0.495 | Weights_l2 --> 18289.972 | Lr --> 0.010 | Seconds_per_step --> 2.089 | [2024-04-21 22:46:19,138][Main][INFO] - [train] Step 7100 out of 120000 | Loss --> 3.639 | Grad_l2 --> 0.486 | Weights_l2 --> 18369.887 | Lr --> 0.010 | Seconds_per_step --> 2.116 | [2024-04-21 22:49:48,184][Main][INFO] - [train] Step 7200 out of 120000 | Loss --> 3.625 | Grad_l2 --> 0.487 | Weights_l2 --> 18450.742 | Lr --> 0.010 | Seconds_per_step --> 2.090 | [2024-04-21 22:53:20,014][Main][INFO] - [train] Step 7300 out of 120000 | Loss --> 3.604 | Grad_l2 --> 0.492 | Weights_l2 --> 18531.195 | Lr --> 0.010 | Seconds_per_step --> 2.118 | [2024-04-21 22:56:51,700][Main][INFO] - [train] Step 7400 out of 120000 | Loss --> 3.580 | Grad_l2 --> 0.484 | Weights_l2 --> 18612.054 | Lr --> 0.010 | Seconds_per_step --> 2.117 | [2024-04-21 23:00:21,493][Main][INFO] - [train] Step 7500 out of 120000 | Loss --> 3.571 | Grad_l2 --> 0.488 | Weights_l2 --> 18694.195 | Lr --> 0.010 | Seconds_per_step --> 2.098 | [2024-04-21 23:03:50,291][Main][INFO] - [train] Step 7600 out of 120000 | Loss --> 3.560 | Grad_l2 --> 0.480 | Weights_l2 --> 18775.685 | Lr --> 0.010 | Seconds_per_step --> 2.088 | [2024-04-21 23:07:21,139][Main][INFO] - [train] Step 7700 out of 120000 | Loss --> 3.536 | Grad_l2 --> 0.522 | Weights_l2 --> 18857.017 | Lr --> 0.010 | Seconds_per_step --> 2.108 | [2024-04-21 23:10:50,537][Main][INFO] - [train] Step 7800 out of 120000 | Loss --> 3.519 | Grad_l2 --> 0.484 | Weights_l2 --> 18940.327 | Lr --> 0.010 | Seconds_per_step --> 2.094 | [2024-04-21 23:14:23,039][Main][INFO] - [train] Step 7900 out of 120000 | Loss --> 3.494 | Grad_l2 --> 0.486 | Weights_l2 --> 19023.019 | Lr --> 0.010 | Seconds_per_step --> 2.125 | [2024-04-21 23:17:51,965][Main][INFO] - [train] Step 8000 out of 120000 | Loss --> 3.488 | Grad_l2 --> 0.478 | Weights_l2 --> 19105.458 | Lr --> 0.010 | Seconds_per_step --> 2.089 | [2024-04-21 23:21:22,793][Main][INFO] - [train] Step 8100 out of 120000 | Loss --> 3.451 | Grad_l2 --> 0.484 | Weights_l2 --> 19189.520 | Lr --> 0.010 | Seconds_per_step --> 2.108 | [2024-04-21 23:24:55,740][Main][INFO] - [train] Step 8200 out of 120000 | Loss --> 3.456 | Grad_l2 --> 0.484 | Weights_l2 --> 19274.025 | Lr --> 0.010 | Seconds_per_step --> 2.129 | [2024-04-21 23:28:28,195][Main][INFO] - [train] Step 8300 out of 120000 | Loss --> 3.456 | Grad_l2 --> 0.485 | Weights_l2 --> 19358.604 | Lr --> 0.010 | Seconds_per_step --> 2.125 | [2024-04-21 23:31:56,238][Main][INFO] - [train] Step 8400 out of 120000 | Loss --> 3.440 | Grad_l2 --> 0.479 | Weights_l2 --> 19442.760 | Lr --> 0.010 | Seconds_per_step --> 2.080 | [2024-04-21 23:35:28,202][Main][INFO] - [train] Step 8500 out of 120000 | Loss --> 3.421 | Grad_l2 --> 0.526 | Weights_l2 --> 19527.324 | Lr --> 0.010 | Seconds_per_step --> 2.120 | [2024-04-21 23:38:58,976][Main][INFO] - [train] Step 8600 out of 120000 | Loss --> 3.398 | Grad_l2 --> 0.472 | Weights_l2 --> 19611.838 | Lr --> 0.010 | Seconds_per_step --> 2.108 | [2024-04-21 23:42:28,940][Main][INFO] - [train] Step 8700 out of 120000 | Loss --> 3.380 | Grad_l2 --> 0.467 | Weights_l2 --> 19695.957 | Lr --> 0.010 | Seconds_per_step --> 2.100 | [2024-04-21 23:45:59,794][Main][INFO] - [train] Step 8800 out of 120000 | Loss --> 3.377 | Grad_l2 --> 0.478 | Weights_l2 --> 19781.031 | Lr --> 0.010 | Seconds_per_step --> 2.109 | [2024-04-21 23:49:31,577][Main][INFO] - [train] Step 8900 out of 120000 | Loss --> 3.364 | Grad_l2 --> 0.469 | Weights_l2 --> 19864.466 | Lr --> 0.010 | Seconds_per_step --> 2.118 | [2024-04-21 23:53:01,057][Main][INFO] - [train] Step 9000 out of 120000 | Loss --> 3.351 | Grad_l2 --> 0.476 | Weights_l2 --> 19948.655 | Lr --> 0.010 | Seconds_per_step --> 2.095 | [2024-04-21 23:56:32,585][Main][INFO] - [train] Step 9100 out of 120000 | Loss --> 3.343 | Grad_l2 --> 0.467 | Weights_l2 --> 20032.884 | Lr --> 0.010 | Seconds_per_step --> 2.115 | [2024-04-22 00:00:04,544][Main][INFO] - [train] Step 9200 out of 120000 | Loss --> 3.325 | Grad_l2 --> 0.459 | Weights_l2 --> 20116.344 | Lr --> 0.010 | Seconds_per_step --> 2.120 | [2024-04-22 00:03:33,267][Main][INFO] - [train] Step 9300 out of 120000 | Loss --> 3.331 | Grad_l2 --> 0.460 | Weights_l2 --> 20201.144 | Lr --> 0.010 | Seconds_per_step --> 2.087 | [2024-04-22 00:07:00,369][Main][INFO] - [train] Step 9400 out of 120000 | Loss --> 3.312 | Grad_l2 --> 0.452 | Weights_l2 --> 20285.782 | Lr --> 0.010 | Seconds_per_step --> 2.071 | [2024-04-22 00:10:35,881][Main][INFO] - [train] Step 9500 out of 120000 | Loss --> 3.297 | Grad_l2 --> 0.456 | Weights_l2 --> 20371.088 | Lr --> 0.010 | Seconds_per_step --> 2.155 | [2024-04-22 00:14:03,995][Main][INFO] - [train] Step 9600 out of 120000 | Loss --> 3.290 | Grad_l2 --> 0.450 | Weights_l2 --> 20457.058 | Lr --> 0.010 | Seconds_per_step --> 2.081 | [2024-04-22 00:17:33,737][Main][INFO] - [train] Step 9700 out of 120000 | Loss --> 3.265 | Grad_l2 --> 0.447 | Weights_l2 --> 20542.372 | Lr --> 0.010 | Seconds_per_step --> 2.097 | [2024-04-22 00:21:06,054][Main][INFO] - [train] Step 9800 out of 120000 | Loss --> 3.279 | Grad_l2 --> 0.511 | Weights_l2 --> 20627.682 | Lr --> 0.010 | Seconds_per_step --> 2.123 | [2024-04-22 00:24:38,144][Main][INFO] - [train] Step 9900 out of 120000 | Loss --> 3.246 | Grad_l2 --> 0.445 | Weights_l2 --> 20712.195 | Lr --> 0.010 | Seconds_per_step --> 2.121 | [2024-04-22 00:28:08,566][Main][INFO] - [train] Step 10000 out of 120000 | Loss --> 3.250 | Grad_l2 --> 0.446 | Weights_l2 --> 20797.876 | Lr --> 0.010 | Seconds_per_step --> 2.104 | [2024-04-22 00:28:08,821][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-22 00:32:30,638][Main][INFO] - [eval] Step 10000 out of 120000 | Loss --> 3.109 | Accuracy --> 0.513 | Time --> 262.070 | [2024-04-22 00:32:30,642][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-10000 [2024-04-22 00:32:30,645][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-04-22 00:32:34,798][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-10000/model.safetensors [2024-04-22 00:32:34,849][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-10000/optimizer.bin [2024-04-22 00:32:34,850][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-10000/scheduler.bin [2024-04-22 00:32:34,850][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-10000/sampler.bin [2024-04-22 00:32:34,850][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-10000/sampler_1.bin [2024-04-22 00:32:34,852][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-10000/random_states_0.pkl [2024-04-22 00:36:03,039][Main][INFO] - [train] Step 10100 out of 120000 | Loss --> 3.240 | Grad_l2 --> 0.441 | Weights_l2 --> 20883.278 | Lr --> 0.010 | Seconds_per_step --> 2.124 | [2024-04-22 00:39:35,641][Main][INFO] - [train] Step 10200 out of 120000 | Loss --> 3.227 | Grad_l2 --> 0.442 | Weights_l2 --> 20968.779 | Lr --> 0.010 | Seconds_per_step --> 2.126 | [2024-04-22 00:43:06,339][Main][INFO] - [train] Step 10300 out of 120000 | Loss --> 3.216 | Grad_l2 --> 0.435 | Weights_l2 --> 21053.736 | Lr --> 0.010 | Seconds_per_step --> 2.107 | [2024-04-22 00:46:38,968][Main][INFO] - [train] Step 10400 out of 120000 | Loss --> 3.216 | Grad_l2 --> 0.437 | Weights_l2 --> 21136.938 | Lr --> 0.010 | Seconds_per_step --> 2.126 | [2024-04-22 00:50:06,495][Main][INFO] - [train] Step 10500 out of 120000 | Loss --> 3.200 | Grad_l2 --> 0.432 | Weights_l2 --> 21220.309 | Lr --> 0.010 | Seconds_per_step --> 2.075 | [2024-04-22 00:53:36,114][Main][INFO] - [train] Step 10600 out of 120000 | Loss --> 3.178 | Grad_l2 --> 0.436 | Weights_l2 --> 21303.020 | Lr --> 0.010 | Seconds_per_step --> 2.096 | [2024-04-22 00:57:06,695][Main][INFO] - [train] Step 10700 out of 120000 | Loss --> 3.178 | Grad_l2 --> 0.430 | Weights_l2 --> 21385.454 | Lr --> 0.010 | Seconds_per_step --> 2.106 | [2024-04-22 01:00:34,868][Main][INFO] - [train] Step 10800 out of 120000 | Loss --> 3.179 | Grad_l2 --> 0.427 | Weights_l2 --> 21466.268 | Lr --> 0.010 | Seconds_per_step --> 2.082 | [2024-04-22 01:04:07,195][Main][INFO] - [train] Step 10900 out of 120000 | Loss --> 3.171 | Grad_l2 --> 0.423 | Weights_l2 --> 21547.448 | Lr --> 0.010 | Seconds_per_step --> 2.123 | [2024-04-22 01:07:39,111][Main][INFO] - [train] Step 11000 out of 120000 | Loss --> 3.146 | Grad_l2 --> 0.421 | Weights_l2 --> 21628.583 | Lr --> 0.010 | Seconds_per_step --> 2.119 | [2024-04-22 01:11:09,187][Main][INFO] - [train] Step 11100 out of 120000 | Loss --> 3.141 | Grad_l2 --> 0.424 | Weights_l2 --> 21707.559 | Lr --> 0.009 | Seconds_per_step --> 2.101 | [2024-04-22 01:14:38,693][Main][INFO] - [train] Step 11200 out of 120000 | Loss --> 3.137 | Grad_l2 --> 0.420 | Weights_l2 --> 21787.691 | Lr --> 0.009 | Seconds_per_step --> 2.095 | [2024-04-22 01:18:09,138][Main][INFO] - [train] Step 11300 out of 120000 | Loss --> 3.116 | Grad_l2 --> 0.420 | Weights_l2 --> 21866.624 | Lr --> 0.009 | Seconds_per_step --> 2.104 | [2024-04-22 01:21:42,091][Main][INFO] - [train] Step 11400 out of 120000 | Loss --> 3.098 | Grad_l2 --> 0.416 | Weights_l2 --> 21944.633 | Lr --> 0.009 | Seconds_per_step --> 2.130 | [2024-04-22 01:25:11,396][Main][INFO] - [train] Step 11500 out of 120000 | Loss --> 3.102 | Grad_l2 --> 0.417 | Weights_l2 --> 22023.121 | Lr --> 0.009 | Seconds_per_step --> 2.093 | [2024-04-22 01:28:43,484][Main][INFO] - [train] Step 11600 out of 120000 | Loss --> 3.084 | Grad_l2 --> 0.414 | Weights_l2 --> 22100.680 | Lr --> 0.009 | Seconds_per_step --> 2.121 | [2024-04-22 01:32:15,182][Main][INFO] - [train] Step 11700 out of 120000 | Loss --> 3.099 | Grad_l2 --> 0.411 | Weights_l2 --> 22177.667 | Lr --> 0.009 | Seconds_per_step --> 2.117 | [2024-04-22 01:35:48,095][Main][INFO] - [train] Step 11800 out of 120000 | Loss --> 3.062 | Grad_l2 --> 0.415 | Weights_l2 --> 22254.802 | Lr --> 0.009 | Seconds_per_step --> 2.129 | [2024-04-22 01:39:17,038][Main][INFO] - [train] Step 11900 out of 120000 | Loss --> 3.063 | Grad_l2 --> 0.419 | Weights_l2 --> 22331.593 | Lr --> 0.009 | Seconds_per_step --> 2.089 | [2024-04-22 01:42:49,790][Main][INFO] - [train] Step 12000 out of 120000 | Loss --> 3.058 | Grad_l2 --> 0.415 | Weights_l2 --> 22408.580 | Lr --> 0.009 | Seconds_per_step --> 2.128 | [2024-04-22 01:46:22,676][Main][INFO] - [train] Step 12100 out of 120000 | Loss --> 3.065 | Grad_l2 --> 0.412 | Weights_l2 --> 22484.201 | Lr --> 0.009 | Seconds_per_step --> 2.129 | [2024-04-22 01:49:51,638][Main][INFO] - [train] Step 12200 out of 120000 | Loss --> 3.045 | Grad_l2 --> 0.420 | Weights_l2 --> 22559.974 | Lr --> 0.009 | Seconds_per_step --> 2.090 | [2024-04-22 01:53:21,495][Main][INFO] - [train] Step 12300 out of 120000 | Loss --> 3.045 | Grad_l2 --> 0.411 | Weights_l2 --> 22635.396 | Lr --> 0.009 | Seconds_per_step --> 2.099 | [2024-04-22 01:56:52,266][Main][INFO] - [train] Step 12400 out of 120000 | Loss --> 3.016 | Grad_l2 --> 0.409 | Weights_l2 --> 22710.927 | Lr --> 0.009 | Seconds_per_step --> 2.108 | [2024-04-22 02:00:21,037][Main][INFO] - [train] Step 12500 out of 120000 | Loss --> 3.017 | Grad_l2 --> 0.408 | Weights_l2 --> 22786.070 | Lr --> 0.009 | Seconds_per_step --> 2.088 | [2024-04-22 02:03:53,990][Main][INFO] - [train] Step 12600 out of 120000 | Loss --> 3.017 | Grad_l2 --> 0.404 | Weights_l2 --> 22860.382 | Lr --> 0.009 | Seconds_per_step --> 2.130 | [2024-04-22 02:07:28,600][Main][INFO] - [train] Step 12700 out of 120000 | Loss --> 3.012 | Grad_l2 --> 0.408 | Weights_l2 --> 22935.581 | Lr --> 0.009 | Seconds_per_step --> 2.146 | [2024-04-22 02:10:58,167][Main][INFO] - [train] Step 12800 out of 120000 | Loss --> 2.988 | Grad_l2 --> 0.403 | Weights_l2 --> 23010.177 | Lr --> 0.009 | Seconds_per_step --> 2.096 | [2024-04-22 02:14:29,594][Main][INFO] - [train] Step 12900 out of 120000 | Loss --> 2.993 | Grad_l2 --> 0.441 | Weights_l2 --> 23084.927 | Lr --> 0.009 | Seconds_per_step --> 2.114 | [2024-04-22 02:18:01,467][Main][INFO] - [train] Step 13000 out of 120000 | Loss --> 2.989 | Grad_l2 --> 0.403 | Weights_l2 --> 23158.990 | Lr --> 0.009 | Seconds_per_step --> 2.119 | [2024-04-22 02:21:29,490][Main][INFO] - [train] Step 13100 out of 120000 | Loss --> 2.957 | Grad_l2 --> 0.406 | Weights_l2 --> 23232.323 | Lr --> 0.009 | Seconds_per_step --> 2.080 | [2024-04-22 02:24:59,570][Main][INFO] - [train] Step 13200 out of 120000 | Loss --> 2.962 | Grad_l2 --> 0.402 | Weights_l2 --> 23306.196 | Lr --> 0.009 | Seconds_per_step --> 2.101 | [2024-04-22 02:28:28,444][Main][INFO] - [train] Step 13300 out of 120000 | Loss --> 2.962 | Grad_l2 --> 0.404 | Weights_l2 --> 23379.247 | Lr --> 0.009 | Seconds_per_step --> 2.089 | [2024-04-22 02:31:59,106][Main][INFO] - [train] Step 13400 out of 120000 | Loss --> 2.932 | Grad_l2 --> 0.399 | Weights_l2 --> 23451.941 | Lr --> 0.009 | Seconds_per_step --> 2.107 | [2024-04-22 02:35:29,894][Main][INFO] - [train] Step 13500 out of 120000 | Loss --> 2.956 | Grad_l2 --> 0.403 | Weights_l2 --> 23524.908 | Lr --> 0.009 | Seconds_per_step --> 2.108 | [2024-04-22 02:39:00,351][Main][INFO] - [train] Step 13600 out of 120000 | Loss --> 2.941 | Grad_l2 --> 0.399 | Weights_l2 --> 23597.919 | Lr --> 0.009 | Seconds_per_step --> 2.105 | [2024-04-22 02:42:33,567][Main][INFO] - [train] Step 13700 out of 120000 | Loss --> 2.947 | Grad_l2 --> 0.398 | Weights_l2 --> 23669.785 | Lr --> 0.009 | Seconds_per_step --> 2.132 | [2024-04-22 02:46:04,598][Main][INFO] - [train] Step 13800 out of 120000 | Loss --> 2.940 | Grad_l2 --> 0.400 | Weights_l2 --> 23741.630 | Lr --> 0.009 | Seconds_per_step --> 2.110 | [2024-04-22 02:49:35,242][Main][INFO] - [train] Step 13900 out of 120000 | Loss --> 2.951 | Grad_l2 --> 0.400 | Weights_l2 --> 23813.636 | Lr --> 0.008 | Seconds_per_step --> 2.106 | [2024-04-22 02:53:06,639][Main][INFO] - [train] Step 14000 out of 120000 | Loss --> 2.937 | Grad_l2 --> 0.400 | Weights_l2 --> 23884.347 | Lr --> 0.008 | Seconds_per_step --> 2.114 | [2024-04-22 02:56:36,797][Main][INFO] - [train] Step 14100 out of 120000 | Loss --> 2.924 | Grad_l2 --> 0.401 | Weights_l2 --> 23955.337 | Lr --> 0.008 | Seconds_per_step --> 2.102 | [2024-04-22 03:00:07,870][Main][INFO] - [train] Step 14200 out of 120000 | Loss --> 2.924 | Grad_l2 --> 0.401 | Weights_l2 --> 24025.847 | Lr --> 0.008 | Seconds_per_step --> 2.111 | [2024-04-22 03:03:37,145][Main][INFO] - [train] Step 14300 out of 120000 | Loss --> 2.928 | Grad_l2 --> 0.392 | Weights_l2 --> 24095.801 | Lr --> 0.008 | Seconds_per_step --> 2.093 | [2024-04-22 03:07:10,838][Main][INFO] - [train] Step 14400 out of 120000 | Loss --> 2.923 | Grad_l2 --> 0.391 | Weights_l2 --> 24165.587 | Lr --> 0.008 | Seconds_per_step --> 2.137 | [2024-04-22 03:10:40,368][Main][INFO] - [train] Step 14500 out of 120000 | Loss --> 2.915 | Grad_l2 --> 0.396 | Weights_l2 --> 24235.654 | Lr --> 0.008 | Seconds_per_step --> 2.095 | [2024-04-22 03:14:11,206][Main][INFO] - [train] Step 14600 out of 120000 | Loss --> 2.911 | Grad_l2 --> 0.393 | Weights_l2 --> 24305.532 | Lr --> 0.008 | Seconds_per_step --> 2.108 | [2024-04-22 03:17:41,067][Main][INFO] - [train] Step 14700 out of 120000 | Loss --> 2.917 | Grad_l2 --> 0.396 | Weights_l2 --> 24374.415 | Lr --> 0.008 | Seconds_per_step --> 2.099 | [2024-04-22 03:21:12,868][Main][INFO] - [train] Step 14800 out of 120000 | Loss --> 2.906 | Grad_l2 --> 0.401 | Weights_l2 --> 24444.820 | Lr --> 0.008 | Seconds_per_step --> 2.118 | [2024-04-22 03:24:43,171][Main][INFO] - [train] Step 14900 out of 120000 | Loss --> 2.893 | Grad_l2 --> 0.392 | Weights_l2 --> 24513.563 | Lr --> 0.008 | Seconds_per_step --> 2.103 | [2024-04-22 03:28:14,440][Main][INFO] - [train] Step 15000 out of 120000 | Loss --> 2.890 | Grad_l2 --> 0.393 | Weights_l2 --> 24582.205 | Lr --> 0.008 | Seconds_per_step --> 2.113 | [2024-04-22 03:28:14,693][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-22 03:32:35,086][Main][INFO] - [eval] Step 15000 out of 120000 | Loss --> 2.767 | Accuracy --> 0.550 | Time --> 260.644 | [2024-04-22 03:36:05,751][Main][INFO] - [train] Step 15100 out of 120000 | Loss --> 2.893 | Grad_l2 --> 0.389 | Weights_l2 --> 24650.827 | Lr --> 0.008 | Seconds_per_step --> 2.107 | [2024-04-22 03:39:36,072][Main][INFO] - [train] Step 15200 out of 120000 | Loss --> 2.876 | Grad_l2 --> 0.388 | Weights_l2 --> 24718.906 | Lr --> 0.008 | Seconds_per_step --> 2.103 | [2024-04-22 03:43:05,498][Main][INFO] - [train] Step 15300 out of 120000 | Loss --> 2.879 | Grad_l2 --> 0.391 | Weights_l2 --> 24786.850 | Lr --> 0.008 | Seconds_per_step --> 2.094 | [2024-04-22 03:46:35,244][Main][INFO] - [train] Step 15400 out of 120000 | Loss --> 2.857 | Grad_l2 --> 0.388 | Weights_l2 --> 24854.980 | Lr --> 0.008 | Seconds_per_step --> 2.097 | [2024-04-22 03:50:05,994][Main][INFO] - [train] Step 15500 out of 120000 | Loss --> 2.853 | Grad_l2 --> 0.392 | Weights_l2 --> 24922.724 | Lr --> 0.008 | Seconds_per_step --> 2.107 | [2024-04-22 03:53:37,767][Main][INFO] - [train] Step 15600 out of 120000 | Loss --> 2.856 | Grad_l2 --> 0.385 | Weights_l2 --> 24990.483 | Lr --> 0.008 | Seconds_per_step --> 2.118 | [2024-04-22 03:57:07,399][Main][INFO] - [train] Step 15700 out of 120000 | Loss --> 2.844 | Grad_l2 --> 0.389 | Weights_l2 --> 25058.146 | Lr --> 0.008 | Seconds_per_step --> 2.096 | [2024-04-22 04:00:38,348][Main][INFO] - [train] Step 15800 out of 120000 | Loss --> 2.852 | Grad_l2 --> 0.392 | Weights_l2 --> 25125.645 | Lr --> 0.008 | Seconds_per_step --> 2.109 | [2024-04-22 04:04:08,006][Main][INFO] - [train] Step 15900 out of 120000 | Loss --> 2.838 | Grad_l2 --> 0.399 | Weights_l2 --> 25193.177 | Lr --> 0.008 | Seconds_per_step --> 2.097 | [2024-04-22 04:07:38,737][Main][INFO] - [train] Step 16000 out of 120000 | Loss --> 2.833 | Grad_l2 --> 0.395 | Weights_l2 --> 25260.378 | Lr --> 0.008 | Seconds_per_step --> 2.107 | [2024-04-22 04:11:08,895][Main][INFO] - [train] Step 16100 out of 120000 | Loss --> 2.847 | Grad_l2 --> 0.388 | Weights_l2 --> 25326.620 | Lr --> 0.008 | Seconds_per_step --> 2.102 | [2024-04-22 04:14:42,698][Main][INFO] - [train] Step 16200 out of 120000 | Loss --> 2.840 | Grad_l2 --> 0.386 | Weights_l2 --> 25392.876 | Lr --> 0.008 | Seconds_per_step --> 2.138 | [2024-04-22 04:18:11,953][Main][INFO] - [train] Step 16300 out of 120000 | Loss --> 2.834 | Grad_l2 --> 0.395 | Weights_l2 --> 25459.067 | Lr --> 0.008 | Seconds_per_step --> 2.093 | [2024-04-22 04:21:40,295][Main][INFO] - [train] Step 16400 out of 120000 | Loss --> 2.808 | Grad_l2 --> 0.391 | Weights_l2 --> 25525.022 | Lr --> 0.008 | Seconds_per_step --> 2.083 | [2024-04-22 04:25:14,289][Main][INFO] - [train] Step 16500 out of 120000 | Loss --> 2.818 | Grad_l2 --> 0.392 | Weights_l2 --> 25591.069 | Lr --> 0.008 | Seconds_per_step --> 2.140 | [2024-04-22 04:28:46,274][Main][INFO] - [train] Step 16600 out of 120000 | Loss --> 2.823 | Grad_l2 --> 0.389 | Weights_l2 --> 25656.423 | Lr --> 0.008 | Seconds_per_step --> 2.120 | [2024-04-22 04:32:15,005][Main][INFO] - [train] Step 16700 out of 120000 | Loss --> 2.818 | Grad_l2 --> 0.404 | Weights_l2 --> 25722.847 | Lr --> 0.008 | Seconds_per_step --> 2.087 | [2024-04-22 04:35:44,240][Main][INFO] - [train] Step 16800 out of 120000 | Loss --> 2.818 | Grad_l2 --> 0.388 | Weights_l2 --> 25788.180 | Lr --> 0.008 | Seconds_per_step --> 2.092 | [2024-04-22 04:39:17,838][Main][INFO] - [train] Step 16900 out of 120000 | Loss --> 2.805 | Grad_l2 --> 0.388 | Weights_l2 --> 25853.877 | Lr --> 0.008 | Seconds_per_step --> 2.136 | [2024-04-22 04:42:48,038][Main][INFO] - [train] Step 17000 out of 120000 | Loss --> 2.806 | Grad_l2 --> 0.389 | Weights_l2 --> 25917.965 | Lr --> 0.008 | Seconds_per_step --> 2.102 | [2024-04-22 04:46:18,307][Main][INFO] - [train] Step 17100 out of 120000 | Loss --> 2.785 | Grad_l2 --> 0.390 | Weights_l2 --> 25982.852 | Lr --> 0.008 | Seconds_per_step --> 2.103 | [2024-04-22 04:49:50,094][Main][INFO] - [train] Step 17200 out of 120000 | Loss --> 2.811 | Grad_l2 --> 0.384 | Weights_l2 --> 26047.802 | Lr --> 0.008 | Seconds_per_step --> 2.118 | [2024-04-22 04:53:20,838][Main][INFO] - [train] Step 17300 out of 120000 | Loss --> 2.793 | Grad_l2 --> 0.384 | Weights_l2 --> 26111.725 | Lr --> 0.008 | Seconds_per_step --> 2.107 | [2024-04-22 04:56:51,674][Main][INFO] - [train] Step 17400 out of 120000 | Loss --> 2.796 | Grad_l2 --> 0.387 | Weights_l2 --> 26175.935 | Lr --> 0.008 | Seconds_per_step --> 2.108 | [2024-04-22 05:00:21,439][Main][INFO] - [train] Step 17500 out of 120000 | Loss --> 2.794 | Grad_l2 --> 0.387 | Weights_l2 --> 26239.888 | Lr --> 0.008 | Seconds_per_step --> 2.098 | [2024-04-22 05:03:51,253][Main][INFO] - [train] Step 17600 out of 120000 | Loss --> 2.773 | Grad_l2 --> 0.386 | Weights_l2 --> 26303.014 | Lr --> 0.008 | Seconds_per_step --> 2.098 | [2024-04-22 05:07:19,672][Main][INFO] - [train] Step 17700 out of 120000 | Loss --> 2.789 | Grad_l2 --> 0.389 | Weights_l2 --> 26365.850 | Lr --> 0.008 | Seconds_per_step --> 2.084 | [2024-04-22 05:10:52,850][Main][INFO] - [train] Step 17800 out of 120000 | Loss --> 2.774 | Grad_l2 --> 0.386 | Weights_l2 --> 26429.361 | Lr --> 0.007 | Seconds_per_step --> 2.132 | [2024-04-22 05:14:24,593][Main][INFO] - [train] Step 17900 out of 120000 | Loss --> 2.767 | Grad_l2 --> 0.385 | Weights_l2 --> 26492.218 | Lr --> 0.007 | Seconds_per_step --> 2.117 | [2024-04-22 05:17:53,744][Main][INFO] - [train] Step 18000 out of 120000 | Loss --> 2.773 | Grad_l2 --> 0.388 | Weights_l2 --> 26555.036 | Lr --> 0.007 | Seconds_per_step --> 2.091 | [2024-04-22 05:21:23,243][Main][INFO] - [train] Step 18100 out of 120000 | Loss --> 2.787 | Grad_l2 --> 0.388 | Weights_l2 --> 26618.141 | Lr --> 0.007 | Seconds_per_step --> 2.095 | [2024-04-22 05:24:54,538][Main][INFO] - [train] Step 18200 out of 120000 | Loss --> 2.745 | Grad_l2 --> 0.382 | Weights_l2 --> 26680.408 | Lr --> 0.007 | Seconds_per_step --> 2.113 | [2024-04-22 05:28:25,666][Main][INFO] - [train] Step 18300 out of 120000 | Loss --> 2.772 | Grad_l2 --> 0.386 | Weights_l2 --> 26743.285 | Lr --> 0.007 | Seconds_per_step --> 2.111 | [2024-04-22 05:31:56,057][Main][INFO] - [train] Step 18400 out of 120000 | Loss --> 2.753 | Grad_l2 --> 0.385 | Weights_l2 --> 26804.989 | Lr --> 0.007 | Seconds_per_step --> 2.104 | [2024-04-22 05:35:25,518][Main][INFO] - [train] Step 18500 out of 120000 | Loss --> 2.767 | Grad_l2 --> 0.385 | Weights_l2 --> 26867.318 | Lr --> 0.007 | Seconds_per_step --> 2.095 | [2024-04-22 05:38:59,701][Main][INFO] - [train] Step 18600 out of 120000 | Loss --> 2.757 | Grad_l2 --> 0.382 | Weights_l2 --> 26929.208 | Lr --> 0.007 | Seconds_per_step --> 2.142 | [2024-04-22 05:42:30,426][Main][INFO] - [train] Step 18700 out of 120000 | Loss --> 2.750 | Grad_l2 --> 0.383 | Weights_l2 --> 26991.253 | Lr --> 0.007 | Seconds_per_step --> 2.107 | [2024-04-22 05:46:02,877][Main][INFO] - [train] Step 18800 out of 120000 | Loss --> 2.744 | Grad_l2 --> 0.390 | Weights_l2 --> 27052.361 | Lr --> 0.007 | Seconds_per_step --> 2.125 | [2024-04-22 05:49:30,867][Main][INFO] - [train] Step 18900 out of 120000 | Loss --> 2.744 | Grad_l2 --> 0.382 | Weights_l2 --> 27113.531 | Lr --> 0.007 | Seconds_per_step --> 2.080 | [2024-04-22 05:53:02,403][Main][INFO] - [train] Step 19000 out of 120000 | Loss --> 2.744 | Grad_l2 --> 0.381 | Weights_l2 --> 27173.795 | Lr --> 0.007 | Seconds_per_step --> 2.115 | [2024-04-22 05:56:32,743][Main][INFO] - [train] Step 19100 out of 120000 | Loss --> 2.738 | Grad_l2 --> 0.382 | Weights_l2 --> 27234.426 | Lr --> 0.007 | Seconds_per_step --> 2.103 | [2024-04-22 06:00:04,667][Main][INFO] - [train] Step 19200 out of 120000 | Loss --> 2.727 | Grad_l2 --> 0.385 | Weights_l2 --> 27294.991 | Lr --> 0.007 | Seconds_per_step --> 2.119 | [2024-04-22 06:03:34,294][Main][INFO] - [train] Step 19300 out of 120000 | Loss --> 2.714 | Grad_l2 --> 0.378 | Weights_l2 --> 27355.353 | Lr --> 0.007 | Seconds_per_step --> 2.096 | [2024-04-22 06:07:04,367][Main][INFO] - [train] Step 19400 out of 120000 | Loss --> 2.713 | Grad_l2 --> 0.380 | Weights_l2 --> 27416.339 | Lr --> 0.007 | Seconds_per_step --> 2.101 | [2024-04-22 06:10:34,695][Main][INFO] - [train] Step 19500 out of 120000 | Loss --> 2.706 | Grad_l2 --> 0.386 | Weights_l2 --> 27476.890 | Lr --> 0.007 | Seconds_per_step --> 2.103 | [2024-04-22 06:14:06,795][Main][INFO] - [train] Step 19600 out of 120000 | Loss --> 2.710 | Grad_l2 --> 0.386 | Weights_l2 --> 27537.000 | Lr --> 0.007 | Seconds_per_step --> 2.121 | [2024-04-22 06:17:39,180][Main][INFO] - [train] Step 19700 out of 120000 | Loss --> 2.716 | Grad_l2 --> 0.385 | Weights_l2 --> 27596.859 | Lr --> 0.007 | Seconds_per_step --> 2.124 | [2024-04-22 06:21:09,449][Main][INFO] - [train] Step 19800 out of 120000 | Loss --> 2.695 | Grad_l2 --> 0.382 | Weights_l2 --> 27657.161 | Lr --> 0.007 | Seconds_per_step --> 2.103 | [2024-04-22 06:24:39,494][Main][INFO] - [train] Step 19900 out of 120000 | Loss --> 2.714 | Grad_l2 --> 0.379 | Weights_l2 --> 27717.378 | Lr --> 0.007 | Seconds_per_step --> 2.100 | [2024-04-22 06:28:07,998][Main][INFO] - [train] Step 20000 out of 120000 | Loss --> 2.709 | Grad_l2 --> 0.384 | Weights_l2 --> 27777.501 | Lr --> 0.007 | Seconds_per_step --> 2.085 | [2024-04-22 06:28:08,256][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-22 06:32:31,114][Main][INFO] - [eval] Step 20000 out of 120000 | Loss --> 2.578 | Accuracy --> 0.572 | Time --> 263.113 | [2024-04-22 06:32:31,117][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-20000 [2024-04-22 06:32:31,121][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-04-22 06:32:35,352][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-20000/model.safetensors [2024-04-22 06:32:35,411][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-20000/optimizer.bin [2024-04-22 06:32:35,412][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-20000/scheduler.bin [2024-04-22 06:32:35,413][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-20000/sampler.bin [2024-04-22 06:32:35,413][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-20000/sampler_1.bin [2024-04-22 06:32:35,414][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-20000/random_states_0.pkl [2024-04-22 06:36:05,636][Main][INFO] - [train] Step 20100 out of 120000 | Loss --> 2.707 | Grad_l2 --> 0.381 | Weights_l2 --> 27836.672 | Lr --> 0.007 | Seconds_per_step --> 2.145 | [2024-04-22 06:39:36,387][Main][INFO] - [train] Step 20200 out of 120000 | Loss --> 2.687 | Grad_l2 --> 0.380 | Weights_l2 --> 27896.261 | Lr --> 0.007 | Seconds_per_step --> 2.107 | [2024-04-22 06:43:08,939][Main][INFO] - [train] Step 20300 out of 120000 | Loss --> 2.684 | Grad_l2 --> 0.377 | Weights_l2 --> 27955.344 | Lr --> 0.007 | Seconds_per_step --> 2.126 | [2024-04-22 06:46:39,487][Main][INFO] - [train] Step 20400 out of 120000 | Loss --> 2.705 | Grad_l2 --> 0.378 | Weights_l2 --> 28014.550 | Lr --> 0.007 | Seconds_per_step --> 2.105 | [2024-04-22 06:50:08,724][Main][INFO] - [train] Step 20500 out of 120000 | Loss --> 2.683 | Grad_l2 --> 0.386 | Weights_l2 --> 28074.559 | Lr --> 0.007 | Seconds_per_step --> 2.092 | [2024-04-22 06:53:37,785][Main][INFO] - [train] Step 20600 out of 120000 | Loss --> 2.680 | Grad_l2 --> 0.387 | Weights_l2 --> 28133.467 | Lr --> 0.007 | Seconds_per_step --> 2.091 | [2024-04-22 06:57:08,285][Main][INFO] - [train] Step 20700 out of 120000 | Loss --> 2.679 | Grad_l2 --> 0.376 | Weights_l2 --> 28192.431 | Lr --> 0.007 | Seconds_per_step --> 2.105 | [2024-04-22 07:00:43,396][Main][INFO] - [train] Step 20800 out of 120000 | Loss --> 2.688 | Grad_l2 --> 0.376 | Weights_l2 --> 28250.773 | Lr --> 0.007 | Seconds_per_step --> 2.151 | [2024-04-22 07:04:14,839][Main][INFO] - [train] Step 20900 out of 120000 | Loss --> 2.691 | Grad_l2 --> 0.380 | Weights_l2 --> 28309.417 | Lr --> 0.007 | Seconds_per_step --> 2.114 | [2024-04-22 07:07:46,895][Main][INFO] - [train] Step 21000 out of 120000 | Loss --> 2.682 | Grad_l2 --> 0.380 | Weights_l2 --> 28367.345 | Lr --> 0.007 | Seconds_per_step --> 2.121 | [2024-04-22 07:11:18,138][Main][INFO] - [train] Step 21100 out of 120000 | Loss --> 2.682 | Grad_l2 --> 0.380 | Weights_l2 --> 28426.246 | Lr --> 0.007 | Seconds_per_step --> 2.112 | [2024-04-22 07:14:48,194][Main][INFO] - [train] Step 21200 out of 120000 | Loss --> 2.669 | Grad_l2 --> 0.377 | Weights_l2 --> 28484.433 | Lr --> 0.007 | Seconds_per_step --> 2.101 | [2024-04-22 07:18:18,636][Main][INFO] - [train] Step 21300 out of 120000 | Loss --> 2.679 | Grad_l2 --> 0.376 | Weights_l2 --> 28542.221 | Lr --> 0.007 | Seconds_per_step --> 2.104 | [2024-04-22 07:21:48,067][Main][INFO] - [train] Step 21400 out of 120000 | Loss --> 2.650 | Grad_l2 --> 0.378 | Weights_l2 --> 28599.724 | Lr --> 0.007 | Seconds_per_step --> 2.094 | [2024-04-22 07:25:18,357][Main][INFO] - [train] Step 21500 out of 120000 | Loss --> 2.652 | Grad_l2 --> 0.377 | Weights_l2 --> 28657.601 | Lr --> 0.007 | Seconds_per_step --> 2.103 | [2024-04-22 07:28:50,341][Main][INFO] - [train] Step 21600 out of 120000 | Loss --> 2.649 | Grad_l2 --> 0.377 | Weights_l2 --> 28714.691 | Lr --> 0.007 | Seconds_per_step --> 2.120 | [2024-04-22 07:32:18,805][Main][INFO] - [train] Step 21700 out of 120000 | Loss --> 2.657 | Grad_l2 --> 0.379 | Weights_l2 --> 28771.827 | Lr --> 0.007 | Seconds_per_step --> 2.085 | [2024-04-22 07:35:50,340][Main][INFO] - [train] Step 21800 out of 120000 | Loss --> 2.629 | Grad_l2 --> 0.380 | Weights_l2 --> 28829.401 | Lr --> 0.007 | Seconds_per_step --> 2.115 | [2024-04-22 07:39:23,650][Main][INFO] - [train] Step 21900 out of 120000 | Loss --> 2.654 | Grad_l2 --> 0.378 | Weights_l2 --> 28887.246 | Lr --> 0.007 | Seconds_per_step --> 2.133 | [2024-04-22 07:42:54,487][Main][INFO] - [train] Step 22000 out of 120000 | Loss --> 2.650 | Grad_l2 --> 0.381 | Weights_l2 --> 28944.704 | Lr --> 0.007 | Seconds_per_step --> 2.108 | [2024-04-22 07:46:22,293][Main][INFO] - [train] Step 22100 out of 120000 | Loss --> 2.645 | Grad_l2 --> 0.380 | Weights_l2 --> 29002.272 | Lr --> 0.007 | Seconds_per_step --> 2.078 | [2024-04-22 07:49:55,245][Main][INFO] - [train] Step 22200 out of 120000 | Loss --> 2.648 | Grad_l2 --> 0.375 | Weights_l2 --> 29058.830 | Lr --> 0.007 | Seconds_per_step --> 2.130 | [2024-04-22 07:53:26,039][Main][INFO] - [train] Step 22300 out of 120000 | Loss --> 2.654 | Grad_l2 --> 0.379 | Weights_l2 --> 29115.882 | Lr --> 0.007 | Seconds_per_step --> 2.108 | [2024-04-22 07:56:57,094][Main][INFO] - [train] Step 22400 out of 120000 | Loss --> 2.666 | Grad_l2 --> 0.386 | Weights_l2 --> 29171.214 | Lr --> 0.007 | Seconds_per_step --> 2.111 | [2024-04-22 08:00:28,344][Main][INFO] - [train] Step 22500 out of 120000 | Loss --> 2.628 | Grad_l2 --> 0.381 | Weights_l2 --> 29227.809 | Lr --> 0.007 | Seconds_per_step --> 2.112 | [2024-04-22 08:03:58,544][Main][INFO] - [train] Step 22600 out of 120000 | Loss --> 2.630 | Grad_l2 --> 0.372 | Weights_l2 --> 29284.269 | Lr --> 0.007 | Seconds_per_step --> 2.102 | [2024-04-22 08:07:28,839][Main][INFO] - [train] Step 22700 out of 120000 | Loss --> 2.632 | Grad_l2 --> 0.382 | Weights_l2 --> 29340.669 | Lr --> 0.007 | Seconds_per_step --> 2.103 | [2024-04-22 08:10:59,814][Main][INFO] - [train] Step 22800 out of 120000 | Loss --> 2.632 | Grad_l2 --> 0.377 | Weights_l2 --> 29396.913 | Lr --> 0.007 | Seconds_per_step --> 2.110 | [2024-04-22 08:14:31,296][Main][INFO] - [train] Step 22900 out of 120000 | Loss --> 2.613 | Grad_l2 --> 0.382 | Weights_l2 --> 29452.691 | Lr --> 0.007 | Seconds_per_step --> 2.115 | [2024-04-22 08:18:01,493][Main][INFO] - [train] Step 23000 out of 120000 | Loss --> 2.623 | Grad_l2 --> 0.383 | Weights_l2 --> 29507.583 | Lr --> 0.007 | Seconds_per_step --> 2.102 | [2024-04-22 08:21:31,466][Main][INFO] - [train] Step 23100 out of 120000 | Loss --> 2.603 | Grad_l2 --> 0.384 | Weights_l2 --> 29563.019 | Lr --> 0.007 | Seconds_per_step --> 2.100 | [2024-04-22 08:25:00,893][Main][INFO] - [train] Step 23200 out of 120000 | Loss --> 2.605 | Grad_l2 --> 0.375 | Weights_l2 --> 29618.125 | Lr --> 0.007 | Seconds_per_step --> 2.094 | [2024-04-22 08:28:31,994][Main][INFO] - [train] Step 23300 out of 120000 | Loss --> 2.602 | Grad_l2 --> 0.374 | Weights_l2 --> 29673.068 | Lr --> 0.007 | Seconds_per_step --> 2.111 | [2024-04-22 08:32:03,773][Main][INFO] - [train] Step 23400 out of 120000 | Loss --> 2.612 | Grad_l2 --> 0.376 | Weights_l2 --> 29728.309 | Lr --> 0.007 | Seconds_per_step --> 2.118 | [2024-04-22 08:35:34,099][Main][INFO] - [train] Step 23500 out of 120000 | Loss --> 2.603 | Grad_l2 --> 0.375 | Weights_l2 --> 29783.745 | Lr --> 0.007 | Seconds_per_step --> 2.103 | [2024-04-22 08:39:07,137][Main][INFO] - [train] Step 23600 out of 120000 | Loss --> 2.599 | Grad_l2 --> 0.378 | Weights_l2 --> 29838.819 | Lr --> 0.007 | Seconds_per_step --> 2.130 | [2024-04-22 08:42:34,298][Main][INFO] - [train] Step 23700 out of 120000 | Loss --> 2.597 | Grad_l2 --> 0.378 | Weights_l2 --> 29893.747 | Lr --> 0.006 | Seconds_per_step --> 2.072 | [2024-04-22 08:46:06,577][Main][INFO] - [train] Step 23800 out of 120000 | Loss --> 2.596 | Grad_l2 --> 0.380 | Weights_l2 --> 29949.158 | Lr --> 0.006 | Seconds_per_step --> 2.123 | [2024-04-22 08:49:35,499][Main][INFO] - [train] Step 23900 out of 120000 | Loss --> 2.600 | Grad_l2 --> 0.387 | Weights_l2 --> 30004.277 | Lr --> 0.006 | Seconds_per_step --> 2.089 | [2024-04-22 08:53:04,938][Main][INFO] - [train] Step 24000 out of 120000 | Loss --> 2.597 | Grad_l2 --> 0.374 | Weights_l2 --> 30058.570 | Lr --> 0.006 | Seconds_per_step --> 2.094 | [2024-04-22 08:56:34,577][Main][INFO] - [train] Step 24100 out of 120000 | Loss --> 2.581 | Grad_l2 --> 0.380 | Weights_l2 --> 30112.601 | Lr --> 0.006 | Seconds_per_step --> 2.096 | [2024-04-22 09:00:06,395][Main][INFO] - [train] Step 24200 out of 120000 | Loss --> 2.583 | Grad_l2 --> 0.381 | Weights_l2 --> 30166.804 | Lr --> 0.006 | Seconds_per_step --> 2.118 | [2024-04-22 09:03:37,639][Main][INFO] - [train] Step 24300 out of 120000 | Loss --> 2.579 | Grad_l2 --> 0.373 | Weights_l2 --> 30220.640 | Lr --> 0.006 | Seconds_per_step --> 2.112 | [2024-04-22 09:07:09,683][Main][INFO] - [train] Step 24400 out of 120000 | Loss --> 2.566 | Grad_l2 --> 0.379 | Weights_l2 --> 30274.871 | Lr --> 0.006 | Seconds_per_step --> 2.120 | [2024-04-22 09:10:41,867][Main][INFO] - [train] Step 24500 out of 120000 | Loss --> 2.561 | Grad_l2 --> 0.375 | Weights_l2 --> 30328.467 | Lr --> 0.006 | Seconds_per_step --> 2.122 | [2024-04-22 09:14:14,552][Main][INFO] - [train] Step 24600 out of 120000 | Loss --> 2.579 | Grad_l2 --> 0.377 | Weights_l2 --> 30381.946 | Lr --> 0.006 | Seconds_per_step --> 2.127 | [2024-04-22 09:17:45,795][Main][INFO] - [train] Step 24700 out of 120000 | Loss --> 2.578 | Grad_l2 --> 0.375 | Weights_l2 --> 30434.910 | Lr --> 0.006 | Seconds_per_step --> 2.112 | [2024-04-22 09:21:15,735][Main][INFO] - [train] Step 24800 out of 120000 | Loss --> 2.571 | Grad_l2 --> 0.375 | Weights_l2 --> 30488.252 | Lr --> 0.006 | Seconds_per_step --> 2.099 | [2024-04-22 09:24:46,942][Main][INFO] - [train] Step 24900 out of 120000 | Loss --> 2.587 | Grad_l2 --> 0.376 | Weights_l2 --> 30542.070 | Lr --> 0.006 | Seconds_per_step --> 2.112 | [2024-04-22 09:28:16,465][Main][INFO] - [train] Step 25000 out of 120000 | Loss --> 2.604 | Grad_l2 --> 0.378 | Weights_l2 --> 30595.078 | Lr --> 0.006 | Seconds_per_step --> 2.095 | [2024-04-22 09:28:16,749][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-22 09:32:37,791][Main][INFO] - [eval] Step 25000 out of 120000 | Loss --> 2.448 | Accuracy --> 0.586 | Time --> 261.324 | [2024-04-22 09:36:08,975][Main][INFO] - [train] Step 25100 out of 120000 | Loss --> 2.575 | Grad_l2 --> 0.371 | Weights_l2 --> 30648.497 | Lr --> 0.006 | Seconds_per_step --> 2.112 | [2024-04-22 09:39:40,180][Main][INFO] - [train] Step 25200 out of 120000 | Loss --> 2.575 | Grad_l2 --> 0.378 | Weights_l2 --> 30701.511 | Lr --> 0.006 | Seconds_per_step --> 2.112 | [2024-04-22 09:43:13,094][Main][INFO] - [train] Step 25300 out of 120000 | Loss --> 2.571 | Grad_l2 --> 0.376 | Weights_l2 --> 30753.709 | Lr --> 0.006 | Seconds_per_step --> 2.129 | [2024-04-22 09:46:44,000][Main][INFO] - [train] Step 25400 out of 120000 | Loss --> 2.563 | Grad_l2 --> 0.370 | Weights_l2 --> 30806.021 | Lr --> 0.006 | Seconds_per_step --> 2.109 | [2024-04-22 09:50:16,138][Main][INFO] - [train] Step 25500 out of 120000 | Loss --> 2.558 | Grad_l2 --> 0.374 | Weights_l2 --> 30858.258 | Lr --> 0.006 | Seconds_per_step --> 2.121 | [2024-04-22 09:53:47,442][Main][INFO] - [train] Step 25600 out of 120000 | Loss --> 2.540 | Grad_l2 --> 0.372 | Weights_l2 --> 30910.031 | Lr --> 0.006 | Seconds_per_step --> 2.113 | [2024-04-22 09:57:18,166][Main][INFO] - [train] Step 25700 out of 120000 | Loss --> 2.557 | Grad_l2 --> 0.379 | Weights_l2 --> 30962.379 | Lr --> 0.006 | Seconds_per_step --> 2.107 | [2024-04-22 10:00:50,493][Main][INFO] - [train] Step 25800 out of 120000 | Loss --> 2.565 | Grad_l2 --> 0.375 | Weights_l2 --> 31014.633 | Lr --> 0.006 | Seconds_per_step --> 2.123 | [2024-04-22 10:04:21,143][Main][INFO] - [train] Step 25900 out of 120000 | Loss --> 2.553 | Grad_l2 --> 0.371 | Weights_l2 --> 31066.586 | Lr --> 0.006 | Seconds_per_step --> 2.106 | [2024-04-22 10:07:52,570][Main][INFO] - [train] Step 26000 out of 120000 | Loss --> 2.552 | Grad_l2 --> 0.373 | Weights_l2 --> 31118.911 | Lr --> 0.006 | Seconds_per_step --> 2.114 | [2024-04-22 10:11:22,503][Main][INFO] - [train] Step 26100 out of 120000 | Loss --> 2.556 | Grad_l2 --> 0.374 | Weights_l2 --> 31171.049 | Lr --> 0.006 | Seconds_per_step --> 2.099 | [2024-04-22 10:14:52,500][Main][INFO] - [train] Step 26200 out of 120000 | Loss --> 2.566 | Grad_l2 --> 0.381 | Weights_l2 --> 31222.915 | Lr --> 0.006 | Seconds_per_step --> 2.100 | [2024-04-22 10:18:23,895][Main][INFO] - [train] Step 26300 out of 120000 | Loss --> 2.547 | Grad_l2 --> 0.369 | Weights_l2 --> 31275.261 | Lr --> 0.006 | Seconds_per_step --> 2.114 | [2024-04-22 10:21:55,049][Main][INFO] - [train] Step 26400 out of 120000 | Loss --> 2.542 | Grad_l2 --> 0.374 | Weights_l2 --> 31326.995 | Lr --> 0.006 | Seconds_per_step --> 2.112 | [2024-04-22 10:25:24,894][Main][INFO] - [train] Step 26500 out of 120000 | Loss --> 2.557 | Grad_l2 --> 0.373 | Weights_l2 --> 31379.012 | Lr --> 0.006 | Seconds_per_step --> 2.098 | [2024-04-22 10:28:55,781][Main][INFO] - [train] Step 26600 out of 120000 | Loss --> 2.542 | Grad_l2 --> 0.377 | Weights_l2 --> 31430.944 | Lr --> 0.006 | Seconds_per_step --> 2.109 | [2024-04-22 10:32:26,866][Main][INFO] - [train] Step 26700 out of 120000 | Loss --> 2.559 | Grad_l2 --> 0.378 | Weights_l2 --> 31482.802 | Lr --> 0.006 | Seconds_per_step --> 2.111 | [2024-04-22 10:35:59,796][Main][INFO] - [train] Step 26800 out of 120000 | Loss --> 2.562 | Grad_l2 --> 0.375 | Weights_l2 --> 31534.116 | Lr --> 0.006 | Seconds_per_step --> 2.129 | [2024-04-22 10:39:30,386][Main][INFO] - [train] Step 26900 out of 120000 | Loss --> 2.556 | Grad_l2 --> 0.375 | Weights_l2 --> 31585.780 | Lr --> 0.006 | Seconds_per_step --> 2.106 | [2024-04-22 10:43:01,112][Main][INFO] - [train] Step 27000 out of 120000 | Loss --> 2.565 | Grad_l2 --> 0.378 | Weights_l2 --> 31637.786 | Lr --> 0.006 | Seconds_per_step --> 2.107 | [2024-04-22 10:46:31,242][Main][INFO] - [train] Step 27100 out of 120000 | Loss --> 2.553 | Grad_l2 --> 0.379 | Weights_l2 --> 31689.563 | Lr --> 0.006 | Seconds_per_step --> 2.101 | [2024-04-22 10:50:04,273][Main][INFO] - [train] Step 27200 out of 120000 | Loss --> 2.542 | Grad_l2 --> 0.376 | Weights_l2 --> 31740.841 | Lr --> 0.006 | Seconds_per_step --> 2.130 | [2024-04-22 10:53:34,772][Main][INFO] - [train] Step 27300 out of 120000 | Loss --> 2.545 | Grad_l2 --> 0.376 | Weights_l2 --> 31792.020 | Lr --> 0.006 | Seconds_per_step --> 2.105 | [2024-04-22 10:57:05,569][Main][INFO] - [train] Step 27400 out of 120000 | Loss --> 2.543 | Grad_l2 --> 0.375 | Weights_l2 --> 31842.240 | Lr --> 0.006 | Seconds_per_step --> 2.108 | [2024-04-22 11:00:35,894][Main][INFO] - [train] Step 27500 out of 120000 | Loss --> 2.531 | Grad_l2 --> 0.378 | Weights_l2 --> 31893.182 | Lr --> 0.006 | Seconds_per_step --> 2.103 | [2024-04-22 11:04:07,467][Main][INFO] - [train] Step 27600 out of 120000 | Loss --> 2.553 | Grad_l2 --> 0.376 | Weights_l2 --> 31943.845 | Lr --> 0.006 | Seconds_per_step --> 2.116 | [2024-04-22 11:07:36,439][Main][INFO] - [train] Step 27700 out of 120000 | Loss --> 2.525 | Grad_l2 --> 0.373 | Weights_l2 --> 31994.119 | Lr --> 0.006 | Seconds_per_step --> 2.090 | [2024-04-22 11:11:07,442][Main][INFO] - [train] Step 27800 out of 120000 | Loss --> 2.522 | Grad_l2 --> 0.384 | Weights_l2 --> 32044.714 | Lr --> 0.006 | Seconds_per_step --> 2.110 | [2024-04-22 11:14:38,181][Main][INFO] - [train] Step 27900 out of 120000 | Loss --> 2.539 | Grad_l2 --> 0.370 | Weights_l2 --> 32094.594 | Lr --> 0.006 | Seconds_per_step --> 2.107 | [2024-04-22 11:18:10,604][Main][INFO] - [train] Step 28000 out of 120000 | Loss --> 2.543 | Grad_l2 --> 0.371 | Weights_l2 --> 32144.521 | Lr --> 0.006 | Seconds_per_step --> 2.124 | [2024-04-22 11:21:41,066][Main][INFO] - [train] Step 28100 out of 120000 | Loss --> 2.530 | Grad_l2 --> 0.374 | Weights_l2 --> 32194.829 | Lr --> 0.006 | Seconds_per_step --> 2.105 | [2024-04-22 11:25:12,495][Main][INFO] - [train] Step 28200 out of 120000 | Loss --> 2.530 | Grad_l2 --> 0.385 | Weights_l2 --> 32244.465 | Lr --> 0.006 | Seconds_per_step --> 2.114 | [2024-04-22 11:28:45,839][Main][INFO] - [train] Step 28300 out of 120000 | Loss --> 2.519 | Grad_l2 --> 0.372 | Weights_l2 --> 32293.677 | Lr --> 0.006 | Seconds_per_step --> 2.133 | [2024-04-22 11:32:18,240][Main][INFO] - [train] Step 28400 out of 120000 | Loss --> 2.527 | Grad_l2 --> 0.372 | Weights_l2 --> 32343.705 | Lr --> 0.006 | Seconds_per_step --> 2.124 | [2024-04-22 11:35:48,378][Main][INFO] - [train] Step 28500 out of 120000 | Loss --> 2.522 | Grad_l2 --> 0.373 | Weights_l2 --> 32393.469 | Lr --> 0.006 | Seconds_per_step --> 2.101 | [2024-04-22 11:39:17,393][Main][INFO] - [train] Step 28600 out of 120000 | Loss --> 2.518 | Grad_l2 --> 0.373 | Weights_l2 --> 32442.600 | Lr --> 0.006 | Seconds_per_step --> 2.090 | [2024-04-22 11:42:48,992][Main][INFO] - [train] Step 28700 out of 120000 | Loss --> 2.526 | Grad_l2 --> 0.374 | Weights_l2 --> 32491.656 | Lr --> 0.006 | Seconds_per_step --> 2.116 | [2024-04-22 11:46:21,281][Main][INFO] - [train] Step 28800 out of 120000 | Loss --> 2.494 | Grad_l2 --> 0.377 | Weights_l2 --> 32540.448 | Lr --> 0.006 | Seconds_per_step --> 2.123 | [2024-04-22 11:49:50,172][Main][INFO] - [train] Step 28900 out of 120000 | Loss --> 2.513 | Grad_l2 --> 0.374 | Weights_l2 --> 32589.914 | Lr --> 0.006 | Seconds_per_step --> 2.089 | [2024-04-22 11:53:20,867][Main][INFO] - [train] Step 29000 out of 120000 | Loss --> 2.500 | Grad_l2 --> 0.373 | Weights_l2 --> 32639.082 | Lr --> 0.006 | Seconds_per_step --> 2.107 | [2024-04-22 11:56:51,539][Main][INFO] - [train] Step 29100 out of 120000 | Loss --> 2.500 | Grad_l2 --> 0.376 | Weights_l2 --> 32688.310 | Lr --> 0.006 | Seconds_per_step --> 2.107 | [2024-04-22 12:00:22,636][Main][INFO] - [train] Step 29200 out of 120000 | Loss --> 2.487 | Grad_l2 --> 0.378 | Weights_l2 --> 32737.249 | Lr --> 0.006 | Seconds_per_step --> 2.111 | [2024-04-22 12:03:52,842][Main][INFO] - [train] Step 29300 out of 120000 | Loss --> 2.474 | Grad_l2 --> 0.372 | Weights_l2 --> 32785.797 | Lr --> 0.006 | Seconds_per_step --> 2.102 | [2024-04-22 12:07:23,381][Main][INFO] - [train] Step 29400 out of 120000 | Loss --> 2.491 | Grad_l2 --> 0.376 | Weights_l2 --> 32835.351 | Lr --> 0.006 | Seconds_per_step --> 2.105 | [2024-04-22 12:10:53,039][Main][INFO] - [train] Step 29500 out of 120000 | Loss --> 2.487 | Grad_l2 --> 0.372 | Weights_l2 --> 32884.114 | Lr --> 0.006 | Seconds_per_step --> 2.097 | [2024-04-22 12:14:24,602][Main][INFO] - [train] Step 29600 out of 120000 | Loss --> 2.481 | Grad_l2 --> 0.374 | Weights_l2 --> 32932.915 | Lr --> 0.006 | Seconds_per_step --> 2.116 | [2024-04-22 12:17:55,349][Main][INFO] - [train] Step 29700 out of 120000 | Loss --> 2.489 | Grad_l2 --> 0.389 | Weights_l2 --> 32981.606 | Lr --> 0.006 | Seconds_per_step --> 2.107 | [2024-04-22 12:21:28,173][Main][INFO] - [train] Step 29800 out of 120000 | Loss --> 2.505 | Grad_l2 --> 0.375 | Weights_l2 --> 33030.229 | Lr --> 0.006 | Seconds_per_step --> 2.128 | [2024-04-22 12:24:57,305][Main][INFO] - [train] Step 29900 out of 120000 | Loss --> 2.489 | Grad_l2 --> 0.371 | Weights_l2 --> 33078.899 | Lr --> 0.006 | Seconds_per_step --> 2.091 | [2024-04-22 12:28:31,384][Main][INFO] - [train] Step 30000 out of 120000 | Loss --> 2.480 | Grad_l2 --> 0.371 | Weights_l2 --> 33126.945 | Lr --> 0.006 | Seconds_per_step --> 2.141 | [2024-04-22 12:28:31,827][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-22 12:32:54,893][Main][INFO] - [eval] Step 30000 out of 120000 | Loss --> 2.360 | Accuracy --> 0.598 | Time --> 263.507 | [2024-04-22 12:32:54,896][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-30000 [2024-04-22 12:32:54,899][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-04-22 12:32:59,509][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-30000/model.safetensors [2024-04-22 12:32:59,561][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-30000/optimizer.bin [2024-04-22 12:32:59,562][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-30000/scheduler.bin [2024-04-22 12:32:59,562][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-30000/sampler.bin [2024-04-22 12:32:59,562][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-30000/sampler_1.bin [2024-04-22 12:32:59,564][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-30000/random_states_0.pkl [2024-04-22 12:36:30,907][Main][INFO] - [train] Step 30100 out of 120000 | Loss --> 2.471 | Grad_l2 --> 0.377 | Weights_l2 --> 33175.419 | Lr --> 0.006 | Seconds_per_step --> 2.160 | [2024-04-22 12:39:58,697][Main][INFO] - [train] Step 30200 out of 120000 | Loss --> 2.499 | Grad_l2 --> 0.373 | Weights_l2 --> 33223.033 | Lr --> 0.006 | Seconds_per_step --> 2.078 | [2024-04-22 12:43:30,003][Main][INFO] - [train] Step 30300 out of 120000 | Loss --> 2.494 | Grad_l2 --> 0.385 | Weights_l2 --> 33271.861 | Lr --> 0.006 | Seconds_per_step --> 2.113 | [2024-04-22 12:47:00,842][Main][INFO] - [train] Step 30400 out of 120000 | Loss --> 2.485 | Grad_l2 --> 0.373 | Weights_l2 --> 33319.329 | Lr --> 0.006 | Seconds_per_step --> 2.108 | [2024-04-22 12:50:32,439][Main][INFO] - [train] Step 30500 out of 120000 | Loss --> 2.491 | Grad_l2 --> 0.376 | Weights_l2 --> 33367.398 | Lr --> 0.006 | Seconds_per_step --> 2.116 | [2024-04-22 12:54:03,648][Main][INFO] - [train] Step 30600 out of 120000 | Loss --> 2.479 | Grad_l2 --> 0.371 | Weights_l2 --> 33414.737 | Lr --> 0.006 | Seconds_per_step --> 2.112 | [2024-04-22 12:57:34,439][Main][INFO] - [train] Step 30700 out of 120000 | Loss --> 2.482 | Grad_l2 --> 0.368 | Weights_l2 --> 33462.044 | Lr --> 0.006 | Seconds_per_step --> 2.108 | [2024-04-22 13:01:05,293][Main][INFO] - [train] Step 30800 out of 120000 | Loss --> 2.482 | Grad_l2 --> 0.375 | Weights_l2 --> 33509.708 | Lr --> 0.006 | Seconds_per_step --> 2.109 | [2024-04-22 13:04:38,410][Main][INFO] - [train] Step 30900 out of 120000 | Loss --> 2.466 | Grad_l2 --> 0.370 | Weights_l2 --> 33557.164 | Lr --> 0.006 | Seconds_per_step --> 2.131 | [2024-04-22 13:08:09,649][Main][INFO] - [train] Step 31000 out of 120000 | Loss --> 2.498 | Grad_l2 --> 0.377 | Weights_l2 --> 33605.023 | Lr --> 0.006 | Seconds_per_step --> 2.112 | [2024-04-22 13:11:41,752][Main][INFO] - [train] Step 31100 out of 120000 | Loss --> 2.475 | Grad_l2 --> 0.374 | Weights_l2 --> 33652.128 | Lr --> 0.006 | Seconds_per_step --> 2.121 | [2024-04-22 13:15:12,753][Main][INFO] - [train] Step 31200 out of 120000 | Loss --> 2.469 | Grad_l2 --> 0.374 | Weights_l2 --> 33698.291 | Lr --> 0.006 | Seconds_per_step --> 2.110 | [2024-04-22 13:18:43,745][Main][INFO] - [train] Step 31300 out of 120000 | Loss --> 2.478 | Grad_l2 --> 0.377 | Weights_l2 --> 33745.519 | Lr --> 0.006 | Seconds_per_step --> 2.110 | [2024-04-22 13:22:12,942][Main][INFO] - [train] Step 31400 out of 120000 | Loss --> 2.471 | Grad_l2 --> 0.383 | Weights_l2 --> 33792.050 | Lr --> 0.006 | Seconds_per_step --> 2.092 | [2024-04-22 13:25:45,367][Main][INFO] - [train] Step 31500 out of 120000 | Loss --> 2.457 | Grad_l2 --> 0.367 | Weights_l2 --> 33838.611 | Lr --> 0.006 | Seconds_per_step --> 2.124 | [2024-04-22 13:29:16,259][Main][INFO] - [train] Step 31600 out of 120000 | Loss --> 2.468 | Grad_l2 --> 0.369 | Weights_l2 --> 33885.226 | Lr --> 0.006 | Seconds_per_step --> 2.109 | [2024-04-22 13:32:47,050][Main][INFO] - [train] Step 31700 out of 120000 | Loss --> 2.467 | Grad_l2 --> 0.369 | Weights_l2 --> 33932.226 | Lr --> 0.006 | Seconds_per_step --> 2.108 | [2024-04-22 13:36:18,284][Main][INFO] - [train] Step 31800 out of 120000 | Loss --> 2.448 | Grad_l2 --> 0.376 | Weights_l2 --> 33979.268 | Lr --> 0.006 | Seconds_per_step --> 2.112 | [2024-04-22 13:39:48,894][Main][INFO] - [train] Step 31900 out of 120000 | Loss --> 2.455 | Grad_l2 --> 0.381 | Weights_l2 --> 34026.440 | Lr --> 0.006 | Seconds_per_step --> 2.106 | [2024-04-22 13:43:20,705][Main][INFO] - [train] Step 32000 out of 120000 | Loss --> 2.450 | Grad_l2 --> 0.370 | Weights_l2 --> 34073.190 | Lr --> 0.006 | Seconds_per_step --> 2.118 | [2024-04-22 13:46:52,151][Main][INFO] - [train] Step 32100 out of 120000 | Loss --> 2.454 | Grad_l2 --> 0.367 | Weights_l2 --> 34119.659 | Lr --> 0.006 | Seconds_per_step --> 2.114 | [2024-04-22 13:50:20,173][Main][INFO] - [train] Step 32200 out of 120000 | Loss --> 2.452 | Grad_l2 --> 0.379 | Weights_l2 --> 34166.078 | Lr --> 0.006 | Seconds_per_step --> 2.080 | [2024-04-22 13:53:51,402][Main][INFO] - [train] Step 32300 out of 120000 | Loss --> 2.464 | Grad_l2 --> 0.386 | Weights_l2 --> 34212.674 | Lr --> 0.006 | Seconds_per_step --> 2.112 | [2024-04-22 13:57:21,241][Main][INFO] - [train] Step 32400 out of 120000 | Loss --> 2.438 | Grad_l2 --> 0.386 | Weights_l2 --> 34259.497 | Lr --> 0.006 | Seconds_per_step --> 2.098 | [2024-04-22 14:00:51,768][Main][INFO] - [train] Step 32500 out of 120000 | Loss --> 2.444 | Grad_l2 --> 0.374 | Weights_l2 --> 34306.410 | Lr --> 0.006 | Seconds_per_step --> 2.105 | [2024-04-22 14:04:23,340][Main][INFO] - [train] Step 32600 out of 120000 | Loss --> 2.451 | Grad_l2 --> 0.378 | Weights_l2 --> 34352.754 | Lr --> 0.006 | Seconds_per_step --> 2.116 | [2024-04-22 14:07:53,267][Main][INFO] - [train] Step 32700 out of 120000 | Loss --> 2.457 | Grad_l2 --> 0.368 | Weights_l2 --> 34398.937 | Lr --> 0.006 | Seconds_per_step --> 2.099 | [2024-04-22 14:11:25,499][Main][INFO] - [train] Step 32800 out of 120000 | Loss --> 2.480 | Grad_l2 --> 0.374 | Weights_l2 --> 34445.372 | Lr --> 0.006 | Seconds_per_step --> 2.122 | [2024-04-22 14:14:58,368][Main][INFO] - [train] Step 32900 out of 120000 | Loss --> 2.468 | Grad_l2 --> 0.379 | Weights_l2 --> 34491.751 | Lr --> 0.006 | Seconds_per_step --> 2.129 | [2024-04-22 14:18:29,567][Main][INFO] - [train] Step 33000 out of 120000 | Loss --> 2.461 | Grad_l2 --> 0.371 | Weights_l2 --> 34538.275 | Lr --> 0.006 | Seconds_per_step --> 2.112 | [2024-04-22 14:22:01,053][Main][INFO] - [train] Step 33100 out of 120000 | Loss --> 2.451 | Grad_l2 --> 0.373 | Weights_l2 --> 34584.579 | Lr --> 0.005 | Seconds_per_step --> 2.115 | [2024-04-22 14:25:30,665][Main][INFO] - [train] Step 33200 out of 120000 | Loss --> 2.455 | Grad_l2 --> 0.377 | Weights_l2 --> 34630.926 | Lr --> 0.005 | Seconds_per_step --> 2.096 | [2024-04-22 14:29:01,892][Main][INFO] - [train] Step 33300 out of 120000 | Loss --> 2.458 | Grad_l2 --> 0.375 | Weights_l2 --> 34676.744 | Lr --> 0.005 | Seconds_per_step --> 2.112 | [2024-04-22 14:32:30,768][Main][INFO] - [train] Step 33400 out of 120000 | Loss --> 2.442 | Grad_l2 --> 0.373 | Weights_l2 --> 34722.553 | Lr --> 0.005 | Seconds_per_step --> 2.089 | [2024-04-22 14:36:01,438][Main][INFO] - [train] Step 33500 out of 120000 | Loss --> 2.447 | Grad_l2 --> 0.380 | Weights_l2 --> 34768.716 | Lr --> 0.005 | Seconds_per_step --> 2.107 | [2024-04-22 14:39:30,946][Main][INFO] - [train] Step 33600 out of 120000 | Loss --> 2.441 | Grad_l2 --> 0.383 | Weights_l2 --> 34813.812 | Lr --> 0.005 | Seconds_per_step --> 2.095 | [2024-04-22 14:43:04,069][Main][INFO] - [train] Step 33700 out of 120000 | Loss --> 2.461 | Grad_l2 --> 0.381 | Weights_l2 --> 34859.366 | Lr --> 0.005 | Seconds_per_step --> 2.131 | [2024-04-22 14:46:33,298][Main][INFO] - [train] Step 33800 out of 120000 | Loss --> 2.441 | Grad_l2 --> 0.378 | Weights_l2 --> 34904.606 | Lr --> 0.005 | Seconds_per_step --> 2.092 | [2024-04-22 14:50:03,228][Main][INFO] - [train] Step 33900 out of 120000 | Loss --> 2.450 | Grad_l2 --> 0.376 | Weights_l2 --> 34950.178 | Lr --> 0.005 | Seconds_per_step --> 2.099 | [2024-04-22 14:53:35,097][Main][INFO] - [train] Step 34000 out of 120000 | Loss --> 2.430 | Grad_l2 --> 0.375 | Weights_l2 --> 34996.054 | Lr --> 0.005 | Seconds_per_step --> 2.119 | [2024-04-22 14:57:08,167][Main][INFO] - [train] Step 34100 out of 120000 | Loss --> 2.432 | Grad_l2 --> 0.369 | Weights_l2 --> 35041.250 | Lr --> 0.005 | Seconds_per_step --> 2.131 | [2024-04-22 15:00:37,896][Main][INFO] - [train] Step 34200 out of 120000 | Loss --> 2.429 | Grad_l2 --> 0.378 | Weights_l2 --> 35086.106 | Lr --> 0.005 | Seconds_per_step --> 2.097 | [2024-04-22 15:04:09,343][Main][INFO] - [train] Step 34300 out of 120000 | Loss --> 2.440 | Grad_l2 --> 0.376 | Weights_l2 --> 35131.445 | Lr --> 0.005 | Seconds_per_step --> 2.114 | [2024-04-22 15:07:41,540][Main][INFO] - [train] Step 34400 out of 120000 | Loss --> 2.434 | Grad_l2 --> 0.370 | Weights_l2 --> 35176.012 | Lr --> 0.005 | Seconds_per_step --> 2.122 | [2024-04-22 15:11:14,439][Main][INFO] - [train] Step 34500 out of 120000 | Loss --> 2.421 | Grad_l2 --> 0.378 | Weights_l2 --> 35221.471 | Lr --> 0.005 | Seconds_per_step --> 2.129 | [2024-04-22 15:14:45,139][Main][INFO] - [train] Step 34600 out of 120000 | Loss --> 2.410 | Grad_l2 --> 0.376 | Weights_l2 --> 35266.139 | Lr --> 0.005 | Seconds_per_step --> 2.107 | [2024-04-22 15:18:15,266][Main][INFO] - [train] Step 34700 out of 120000 | Loss --> 2.423 | Grad_l2 --> 0.373 | Weights_l2 --> 35311.213 | Lr --> 0.005 | Seconds_per_step --> 2.101 | [2024-04-22 15:21:47,000][Main][INFO] - [train] Step 34800 out of 120000 | Loss --> 2.413 | Grad_l2 --> 0.368 | Weights_l2 --> 35355.693 | Lr --> 0.005 | Seconds_per_step --> 2.117 | [2024-04-22 15:25:18,698][Main][INFO] - [train] Step 34900 out of 120000 | Loss --> 2.417 | Grad_l2 --> 0.381 | Weights_l2 --> 35399.640 | Lr --> 0.005 | Seconds_per_step --> 2.117 | [2024-04-22 15:28:49,667][Main][INFO] - [train] Step 35000 out of 120000 | Loss --> 2.432 | Grad_l2 --> 0.384 | Weights_l2 --> 35444.587 | Lr --> 0.005 | Seconds_per_step --> 2.110 | [2024-04-22 15:28:49,912][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-22 15:33:13,893][Main][INFO] - [eval] Step 35000 out of 120000 | Loss --> 2.289 | Accuracy --> 0.606 | Time --> 264.224 | [2024-04-22 15:36:45,768][Main][INFO] - [train] Step 35100 out of 120000 | Loss --> 2.424 | Grad_l2 --> 0.379 | Weights_l2 --> 35489.772 | Lr --> 0.005 | Seconds_per_step --> 2.119 | [2024-04-22 15:40:18,174][Main][INFO] - [train] Step 35200 out of 120000 | Loss --> 2.421 | Grad_l2 --> 0.372 | Weights_l2 --> 35534.218 | Lr --> 0.005 | Seconds_per_step --> 2.124 | [2024-04-22 15:43:49,400][Main][INFO] - [train] Step 35300 out of 120000 | Loss --> 2.422 | Grad_l2 --> 0.373 | Weights_l2 --> 35578.831 | Lr --> 0.005 | Seconds_per_step --> 2.112 | [2024-04-22 15:47:23,166][Main][INFO] - [train] Step 35400 out of 120000 | Loss --> 2.432 | Grad_l2 --> 0.381 | Weights_l2 --> 35623.711 | Lr --> 0.005 | Seconds_per_step --> 2.138 | [2024-04-22 15:50:54,747][Main][INFO] - [train] Step 35500 out of 120000 | Loss --> 2.435 | Grad_l2 --> 0.373 | Weights_l2 --> 35667.886 | Lr --> 0.005 | Seconds_per_step --> 2.116 | [2024-04-22 15:54:27,469][Main][INFO] - [train] Step 35600 out of 120000 | Loss --> 2.431 | Grad_l2 --> 0.372 | Weights_l2 --> 35712.557 | Lr --> 0.005 | Seconds_per_step --> 2.127 | [2024-04-22 15:57:58,940][Main][INFO] - [train] Step 35700 out of 120000 | Loss --> 2.403 | Grad_l2 --> 0.375 | Weights_l2 --> 35756.855 | Lr --> 0.005 | Seconds_per_step --> 2.115 | [2024-04-22 16:01:31,488][Main][INFO] - [train] Step 35800 out of 120000 | Loss --> 2.415 | Grad_l2 --> 0.378 | Weights_l2 --> 35801.247 | Lr --> 0.005 | Seconds_per_step --> 2.125 | [2024-04-22 16:05:03,225][Main][INFO] - [train] Step 35900 out of 120000 | Loss --> 2.408 | Grad_l2 --> 0.375 | Weights_l2 --> 35845.236 | Lr --> 0.005 | Seconds_per_step --> 2.117 | [2024-04-22 16:08:36,138][Main][INFO] - [train] Step 36000 out of 120000 | Loss --> 2.401 | Grad_l2 --> 0.366 | Weights_l2 --> 35889.216 | Lr --> 0.005 | Seconds_per_step --> 2.129 | [2024-04-22 16:12:06,537][Main][INFO] - [train] Step 36100 out of 120000 | Loss --> 2.405 | Grad_l2 --> 0.372 | Weights_l2 --> 35933.740 | Lr --> 0.005 | Seconds_per_step --> 2.104 | [2024-04-22 16:15:39,745][Main][INFO] - [train] Step 36200 out of 120000 | Loss --> 2.383 | Grad_l2 --> 0.374 | Weights_l2 --> 35977.535 | Lr --> 0.005 | Seconds_per_step --> 2.132 | [2024-04-22 16:19:12,340][Main][INFO] - [train] Step 36300 out of 120000 | Loss --> 2.403 | Grad_l2 --> 0.374 | Weights_l2 --> 36021.221 | Lr --> 0.005 | Seconds_per_step --> 2.126 | [2024-04-22 16:22:42,439][Main][INFO] - [train] Step 36400 out of 120000 | Loss --> 2.380 | Grad_l2 --> 0.369 | Weights_l2 --> 36064.966 | Lr --> 0.005 | Seconds_per_step --> 2.101 | [2024-04-22 16:26:14,295][Main][INFO] - [train] Step 36500 out of 120000 | Loss --> 2.375 | Grad_l2 --> 0.374 | Weights_l2 --> 36108.327 | Lr --> 0.005 | Seconds_per_step --> 2.119 | [2024-04-22 16:29:45,594][Main][INFO] - [train] Step 36600 out of 120000 | Loss --> 2.387 | Grad_l2 --> 0.372 | Weights_l2 --> 36152.029 | Lr --> 0.005 | Seconds_per_step --> 2.113 | [2024-04-22 16:33:18,193][Main][INFO] - [train] Step 36700 out of 120000 | Loss --> 2.400 | Grad_l2 --> 0.377 | Weights_l2 --> 36196.063 | Lr --> 0.005 | Seconds_per_step --> 2.126 | [2024-04-22 16:36:49,072][Main][INFO] - [train] Step 36800 out of 120000 | Loss --> 2.398 | Grad_l2 --> 0.374 | Weights_l2 --> 36239.760 | Lr --> 0.005 | Seconds_per_step --> 2.109 | [2024-04-22 16:40:21,381][Main][INFO] - [train] Step 36900 out of 120000 | Loss --> 2.397 | Grad_l2 --> 0.375 | Weights_l2 --> 36283.350 | Lr --> 0.005 | Seconds_per_step --> 2.123 | [2024-04-22 16:43:56,270][Main][INFO] - [train] Step 37000 out of 120000 | Loss --> 2.394 | Grad_l2 --> 0.373 | Weights_l2 --> 36326.756 | Lr --> 0.005 | Seconds_per_step --> 2.149 | [2024-04-22 16:47:24,367][Main][INFO] - [train] Step 37100 out of 120000 | Loss --> 2.389 | Grad_l2 --> 0.377 | Weights_l2 --> 36370.815 | Lr --> 0.005 | Seconds_per_step --> 2.081 | [2024-04-22 16:50:56,298][Main][INFO] - [train] Step 37200 out of 120000 | Loss --> 2.396 | Grad_l2 --> 0.373 | Weights_l2 --> 36414.308 | Lr --> 0.005 | Seconds_per_step --> 2.119 | [2024-04-22 16:54:28,294][Main][INFO] - [train] Step 37300 out of 120000 | Loss --> 2.392 | Grad_l2 --> 0.383 | Weights_l2 --> 36457.965 | Lr --> 0.005 | Seconds_per_step --> 2.120 | [2024-04-22 16:57:59,705][Main][INFO] - [train] Step 37400 out of 120000 | Loss --> 2.392 | Grad_l2 --> 0.378 | Weights_l2 --> 36501.195 | Lr --> 0.005 | Seconds_per_step --> 2.114 | [2024-04-22 17:01:33,894][Main][INFO] - [train] Step 37500 out of 120000 | Loss --> 2.409 | Grad_l2 --> 0.370 | Weights_l2 --> 36544.737 | Lr --> 0.005 | Seconds_per_step --> 2.142 | [2024-04-22 17:05:03,739][Main][INFO] - [train] Step 37600 out of 120000 | Loss --> 2.408 | Grad_l2 --> 0.380 | Weights_l2 --> 36588.103 | Lr --> 0.005 | Seconds_per_step --> 2.098 | [2024-04-22 17:08:35,067][Main][INFO] - [train] Step 37700 out of 120000 | Loss --> 2.401 | Grad_l2 --> 0.384 | Weights_l2 --> 36631.423 | Lr --> 0.005 | Seconds_per_step --> 2.113 | [2024-04-22 17:12:06,683][Main][INFO] - [train] Step 37800 out of 120000 | Loss --> 2.397 | Grad_l2 --> 0.373 | Weights_l2 --> 36673.923 | Lr --> 0.005 | Seconds_per_step --> 2.116 | [2024-04-22 17:15:37,781][Main][INFO] - [train] Step 37900 out of 120000 | Loss --> 2.397 | Grad_l2 --> 0.379 | Weights_l2 --> 36717.076 | Lr --> 0.005 | Seconds_per_step --> 2.111 | [2024-04-22 17:19:12,539][Main][INFO] - [train] Step 38000 out of 120000 | Loss --> 2.392 | Grad_l2 --> 0.377 | Weights_l2 --> 36760.210 | Lr --> 0.005 | Seconds_per_step --> 2.148 | [2024-04-22 17:22:42,336][Main][INFO] - [train] Step 38100 out of 120000 | Loss --> 2.391 | Grad_l2 --> 0.375 | Weights_l2 --> 36803.553 | Lr --> 0.005 | Seconds_per_step --> 2.098 | [2024-04-22 17:26:18,238][Main][INFO] - [train] Step 38200 out of 120000 | Loss --> 2.385 | Grad_l2 --> 0.374 | Weights_l2 --> 36846.895 | Lr --> 0.005 | Seconds_per_step --> 2.159 | [2024-04-22 17:29:46,098][Main][INFO] - [train] Step 38300 out of 120000 | Loss --> 2.368 | Grad_l2 --> 0.379 | Weights_l2 --> 36889.515 | Lr --> 0.005 | Seconds_per_step --> 2.079 | [2024-04-22 17:33:16,809][Main][INFO] - [train] Step 38400 out of 120000 | Loss --> 2.379 | Grad_l2 --> 0.370 | Weights_l2 --> 36931.443 | Lr --> 0.005 | Seconds_per_step --> 2.107 | [2024-04-22 17:36:50,160][Main][INFO] - [train] Step 38500 out of 120000 | Loss --> 2.360 | Grad_l2 --> 0.371 | Weights_l2 --> 36974.201 | Lr --> 0.005 | Seconds_per_step --> 2.134 | [2024-04-22 17:40:21,671][Main][INFO] - [train] Step 38600 out of 120000 | Loss --> 2.370 | Grad_l2 --> 0.378 | Weights_l2 --> 37016.546 | Lr --> 0.005 | Seconds_per_step --> 2.115 | [2024-04-22 17:43:54,152][Main][INFO] - [train] Step 38700 out of 120000 | Loss --> 2.366 | Grad_l2 --> 0.369 | Weights_l2 --> 37058.453 | Lr --> 0.005 | Seconds_per_step --> 2.125 | [2024-04-22 17:47:23,967][Main][INFO] - [train] Step 38800 out of 120000 | Loss --> 2.349 | Grad_l2 --> 0.368 | Weights_l2 --> 37100.459 | Lr --> 0.005 | Seconds_per_step --> 2.098 | [2024-04-22 17:50:56,493][Main][INFO] - [train] Step 38900 out of 120000 | Loss --> 2.359 | Grad_l2 --> 0.370 | Weights_l2 --> 37142.208 | Lr --> 0.005 | Seconds_per_step --> 2.125 | [2024-04-22 17:54:27,496][Main][INFO] - [train] Step 39000 out of 120000 | Loss --> 2.371 | Grad_l2 --> 0.385 | Weights_l2 --> 37184.304 | Lr --> 0.005 | Seconds_per_step --> 2.110 | [2024-04-22 17:57:58,267][Main][INFO] - [train] Step 39100 out of 120000 | Loss --> 2.365 | Grad_l2 --> 0.374 | Weights_l2 --> 37226.385 | Lr --> 0.005 | Seconds_per_step --> 2.108 | [2024-04-22 18:01:31,640][Main][INFO] - [train] Step 39200 out of 120000 | Loss --> 2.378 | Grad_l2 --> 0.377 | Weights_l2 --> 37268.459 | Lr --> 0.005 | Seconds_per_step --> 2.134 | [2024-04-22 18:05:04,839][Main][INFO] - [train] Step 39300 out of 120000 | Loss --> 2.362 | Grad_l2 --> 0.374 | Weights_l2 --> 37309.979 | Lr --> 0.005 | Seconds_per_step --> 2.132 | [2024-04-22 18:08:36,638][Main][INFO] - [train] Step 39400 out of 120000 | Loss --> 2.350 | Grad_l2 --> 0.371 | Weights_l2 --> 37352.226 | Lr --> 0.005 | Seconds_per_step --> 2.118 | [2024-04-22 18:12:08,586][Main][INFO] - [train] Step 39500 out of 120000 | Loss --> 2.362 | Grad_l2 --> 0.370 | Weights_l2 --> 37394.015 | Lr --> 0.005 | Seconds_per_step --> 2.119 | [2024-04-22 18:15:41,571][Main][INFO] - [train] Step 39600 out of 120000 | Loss --> 2.370 | Grad_l2 --> 0.370 | Weights_l2 --> 37435.705 | Lr --> 0.005 | Seconds_per_step --> 2.130 | [2024-04-22 18:19:13,099][Main][INFO] - [train] Step 39700 out of 120000 | Loss --> 2.367 | Grad_l2 --> 0.379 | Weights_l2 --> 37477.424 | Lr --> 0.005 | Seconds_per_step --> 2.115 | [2024-04-22 18:22:44,615][Main][INFO] - [train] Step 39800 out of 120000 | Loss --> 2.383 | Grad_l2 --> 0.375 | Weights_l2 --> 37519.361 | Lr --> 0.005 | Seconds_per_step --> 2.115 | [2024-04-22 18:26:16,293][Main][INFO] - [train] Step 39900 out of 120000 | Loss --> 2.362 | Grad_l2 --> 0.377 | Weights_l2 --> 37561.080 | Lr --> 0.005 | Seconds_per_step --> 2.117 | [2024-04-22 18:29:49,871][Main][INFO] - [train] Step 40000 out of 120000 | Loss --> 2.363 | Grad_l2 --> 0.371 | Weights_l2 --> 37603.093 | Lr --> 0.005 | Seconds_per_step --> 2.136 | [2024-04-22 18:29:50,132][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-22 18:34:13,717][Main][INFO] - [eval] Step 40000 out of 120000 | Loss --> 2.235 | Accuracy --> 0.613 | Time --> 263.844 | [2024-04-22 18:34:13,720][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-40000 [2024-04-22 18:34:13,723][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-04-22 18:34:18,153][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-40000/model.safetensors [2024-04-22 18:34:18,210][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-40000/optimizer.bin [2024-04-22 18:34:18,212][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-40000/scheduler.bin [2024-04-22 18:34:18,212][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-40000/sampler.bin [2024-04-22 18:34:18,212][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-40000/sampler_1.bin [2024-04-22 18:34:18,214][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-40000/random_states_0.pkl [2024-04-22 18:37:50,866][Main][INFO] - [train] Step 40100 out of 120000 | Loss --> 2.361 | Grad_l2 --> 0.377 | Weights_l2 --> 37644.864 | Lr --> 0.005 | Seconds_per_step --> 2.171 | [2024-04-22 18:41:20,439][Main][INFO] - [train] Step 40200 out of 120000 | Loss --> 2.367 | Grad_l2 --> 0.370 | Weights_l2 --> 37687.179 | Lr --> 0.005 | Seconds_per_step --> 2.096 | [2024-04-22 18:44:51,193][Main][INFO] - [train] Step 40300 out of 120000 | Loss --> 2.347 | Grad_l2 --> 0.365 | Weights_l2 --> 37728.754 | Lr --> 0.005 | Seconds_per_step --> 2.108 | [2024-04-22 18:48:24,068][Main][INFO] - [train] Step 40400 out of 120000 | Loss --> 2.362 | Grad_l2 --> 0.370 | Weights_l2 --> 37770.643 | Lr --> 0.005 | Seconds_per_step --> 2.129 | [2024-04-22 18:51:58,238][Main][INFO] - [train] Step 40500 out of 120000 | Loss --> 2.348 | Grad_l2 --> 0.378 | Weights_l2 --> 37812.917 | Lr --> 0.005 | Seconds_per_step --> 2.142 | [2024-04-22 18:55:29,368][Main][INFO] - [train] Step 40600 out of 120000 | Loss --> 2.328 | Grad_l2 --> 0.375 | Weights_l2 --> 37854.139 | Lr --> 0.005 | Seconds_per_step --> 2.111 | [2024-04-22 18:58:59,389][Main][INFO] - [train] Step 40700 out of 120000 | Loss --> 2.338 | Grad_l2 --> 0.379 | Weights_l2 --> 37896.121 | Lr --> 0.005 | Seconds_per_step --> 2.100 | [2024-04-22 19:02:34,798][Main][INFO] - [train] Step 40800 out of 120000 | Loss --> 2.353 | Grad_l2 --> 0.374 | Weights_l2 --> 37937.324 | Lr --> 0.005 | Seconds_per_step --> 2.154 | [2024-04-22 19:06:04,240][Main][INFO] - [train] Step 40900 out of 120000 | Loss --> 2.347 | Grad_l2 --> 0.377 | Weights_l2 --> 37978.608 | Lr --> 0.005 | Seconds_per_step --> 2.094 | [2024-04-22 19:09:37,377][Main][INFO] - [train] Step 41000 out of 120000 | Loss --> 2.357 | Grad_l2 --> 0.378 | Weights_l2 --> 38019.674 | Lr --> 0.005 | Seconds_per_step --> 2.131 | [2024-04-22 19:13:09,765][Main][INFO] - [train] Step 41100 out of 120000 | Loss --> 2.362 | Grad_l2 --> 0.382 | Weights_l2 --> 38060.678 | Lr --> 0.005 | Seconds_per_step --> 2.124 | [2024-04-22 19:16:40,938][Main][INFO] - [train] Step 41200 out of 120000 | Loss --> 2.354 | Grad_l2 --> 0.376 | Weights_l2 --> 38101.318 | Lr --> 0.005 | Seconds_per_step --> 2.112 | [2024-04-22 19:20:14,317][Main][INFO] - [train] Step 41300 out of 120000 | Loss --> 2.369 | Grad_l2 --> 0.373 | Weights_l2 --> 38141.997 | Lr --> 0.005 | Seconds_per_step --> 2.134 | [2024-04-22 19:23:46,386][Main][INFO] - [train] Step 41400 out of 120000 | Loss --> 2.348 | Grad_l2 --> 0.384 | Weights_l2 --> 38183.216 | Lr --> 0.005 | Seconds_per_step --> 2.121 | [2024-04-22 19:27:18,099][Main][INFO] - [train] Step 41500 out of 120000 | Loss --> 2.360 | Grad_l2 --> 0.379 | Weights_l2 --> 38223.979 | Lr --> 0.005 | Seconds_per_step --> 2.117 | [2024-04-22 19:30:50,706][Main][INFO] - [train] Step 41600 out of 120000 | Loss --> 2.357 | Grad_l2 --> 0.371 | Weights_l2 --> 38265.116 | Lr --> 0.005 | Seconds_per_step --> 2.126 | [2024-04-22 19:34:21,942][Main][INFO] - [train] Step 41700 out of 120000 | Loss --> 2.359 | Grad_l2 --> 0.380 | Weights_l2 --> 38305.771 | Lr --> 0.005 | Seconds_per_step --> 2.112 | [2024-04-22 19:37:51,783][Main][INFO] - [train] Step 41800 out of 120000 | Loss --> 2.356 | Grad_l2 --> 0.370 | Weights_l2 --> 38346.686 | Lr --> 0.005 | Seconds_per_step --> 2.098 | [2024-04-22 19:41:22,248][Main][INFO] - [train] Step 41900 out of 120000 | Loss --> 2.356 | Grad_l2 --> 0.380 | Weights_l2 --> 38387.731 | Lr --> 0.005 | Seconds_per_step --> 2.105 | [2024-04-22 19:44:54,245][Main][INFO] - [train] Step 42000 out of 120000 | Loss --> 2.343 | Grad_l2 --> 0.372 | Weights_l2 --> 38428.452 | Lr --> 0.005 | Seconds_per_step --> 2.120 | [2024-04-22 19:48:24,672][Main][INFO] - [train] Step 42100 out of 120000 | Loss --> 2.342 | Grad_l2 --> 0.374 | Weights_l2 --> 38469.122 | Lr --> 0.005 | Seconds_per_step --> 2.104 | [2024-04-22 19:51:57,867][Main][INFO] - [train] Step 42200 out of 120000 | Loss --> 2.352 | Grad_l2 --> 0.381 | Weights_l2 --> 38509.848 | Lr --> 0.005 | Seconds_per_step --> 2.132 | [2024-04-22 19:55:28,138][Main][INFO] - [train] Step 42300 out of 120000 | Loss --> 2.334 | Grad_l2 --> 0.380 | Weights_l2 --> 38550.773 | Lr --> 0.005 | Seconds_per_step --> 2.103 | [2024-04-22 19:59:01,469][Main][INFO] - [train] Step 42400 out of 120000 | Loss --> 2.352 | Grad_l2 --> 0.377 | Weights_l2 --> 38590.744 | Lr --> 0.005 | Seconds_per_step --> 2.133 | [2024-04-22 20:02:33,593][Main][INFO] - [train] Step 42500 out of 120000 | Loss --> 2.323 | Grad_l2 --> 0.409 | Weights_l2 --> 38631.099 | Lr --> 0.005 | Seconds_per_step --> 2.121 | [2024-04-22 20:06:02,997][Main][INFO] - [train] Step 42600 out of 120000 | Loss --> 2.344 | Grad_l2 --> 0.378 | Weights_l2 --> 38671.163 | Lr --> 0.005 | Seconds_per_step --> 2.094 | [2024-04-22 20:09:35,567][Main][INFO] - [train] Step 42700 out of 120000 | Loss --> 2.332 | Grad_l2 --> 0.377 | Weights_l2 --> 38710.848 | Lr --> 0.005 | Seconds_per_step --> 2.126 | [2024-04-22 20:13:09,296][Main][INFO] - [train] Step 42800 out of 120000 | Loss --> 2.330 | Grad_l2 --> 0.373 | Weights_l2 --> 38750.409 | Lr --> 0.005 | Seconds_per_step --> 2.137 | [2024-04-22 20:16:40,341][Main][INFO] - [train] Step 42900 out of 120000 | Loss --> 2.327 | Grad_l2 --> 0.374 | Weights_l2 --> 38790.040 | Lr --> 0.005 | Seconds_per_step --> 2.110 | [2024-04-22 20:20:14,039][Main][INFO] - [train] Step 43000 out of 120000 | Loss --> 2.323 | Grad_l2 --> 0.374 | Weights_l2 --> 38829.951 | Lr --> 0.005 | Seconds_per_step --> 2.137 | [2024-04-22 20:23:44,737][Main][INFO] - [train] Step 43100 out of 120000 | Loss --> 2.321 | Grad_l2 --> 0.377 | Weights_l2 --> 38869.181 | Lr --> 0.005 | Seconds_per_step --> 2.107 | [2024-04-22 20:27:15,967][Main][INFO] - [train] Step 43200 out of 120000 | Loss --> 2.335 | Grad_l2 --> 0.377 | Weights_l2 --> 38909.134 | Lr --> 0.005 | Seconds_per_step --> 2.112 | [2024-04-22 20:30:46,970][Main][INFO] - [train] Step 43300 out of 120000 | Loss --> 2.309 | Grad_l2 --> 0.378 | Weights_l2 --> 38948.909 | Lr --> 0.005 | Seconds_per_step --> 2.110 | [2024-04-22 20:34:21,282][Main][INFO] - [train] Step 43400 out of 120000 | Loss --> 2.308 | Grad_l2 --> 0.368 | Weights_l2 --> 38988.645 | Lr --> 0.005 | Seconds_per_step --> 2.143 | [2024-04-22 20:37:52,085][Main][INFO] - [train] Step 43500 out of 120000 | Loss --> 2.332 | Grad_l2 --> 0.378 | Weights_l2 --> 39028.565 | Lr --> 0.005 | Seconds_per_step --> 2.108 | [2024-04-22 20:41:24,996][Main][INFO] - [train] Step 43600 out of 120000 | Loss --> 2.329 | Grad_l2 --> 0.372 | Weights_l2 --> 39067.592 | Lr --> 0.005 | Seconds_per_step --> 2.129 | [2024-04-22 20:44:54,356][Main][INFO] - [train] Step 43700 out of 120000 | Loss --> 2.318 | Grad_l2 --> 0.385 | Weights_l2 --> 39107.365 | Lr --> 0.005 | Seconds_per_step --> 2.094 | [2024-04-22 20:48:27,670][Main][INFO] - [train] Step 43800 out of 120000 | Loss --> 2.319 | Grad_l2 --> 0.373 | Weights_l2 --> 39146.525 | Lr --> 0.005 | Seconds_per_step --> 2.133 | [2024-04-22 20:52:00,581][Main][INFO] - [train] Step 43900 out of 120000 | Loss --> 2.323 | Grad_l2 --> 0.377 | Weights_l2 --> 39186.270 | Lr --> 0.005 | Seconds_per_step --> 2.129 | [2024-04-22 20:55:33,651][Main][INFO] - [train] Step 44000 out of 120000 | Loss --> 2.324 | Grad_l2 --> 0.372 | Weights_l2 --> 39225.517 | Lr --> 0.005 | Seconds_per_step --> 2.131 | [2024-04-22 20:59:05,669][Main][INFO] - [train] Step 44100 out of 120000 | Loss --> 2.314 | Grad_l2 --> 0.376 | Weights_l2 --> 39264.323 | Lr --> 0.005 | Seconds_per_step --> 2.120 | [2024-04-22 21:02:38,866][Main][INFO] - [train] Step 44200 out of 120000 | Loss --> 2.314 | Grad_l2 --> 0.373 | Weights_l2 --> 39303.369 | Lr --> 0.005 | Seconds_per_step --> 2.132 | [2024-04-22 21:06:10,851][Main][INFO] - [train] Step 44300 out of 120000 | Loss --> 2.317 | Grad_l2 --> 0.385 | Weights_l2 --> 39342.630 | Lr --> 0.005 | Seconds_per_step --> 2.120 | [2024-04-22 21:09:42,405][Main][INFO] - [train] Step 44400 out of 120000 | Loss --> 2.326 | Grad_l2 --> 0.379 | Weights_l2 --> 39381.879 | Lr --> 0.005 | Seconds_per_step --> 2.116 | [2024-04-22 21:13:14,173][Main][INFO] - [train] Step 44500 out of 120000 | Loss --> 2.326 | Grad_l2 --> 0.382 | Weights_l2 --> 39420.813 | Lr --> 0.005 | Seconds_per_step --> 2.118 | [2024-04-22 21:16:45,340][Main][INFO] - [train] Step 44600 out of 120000 | Loss --> 2.310 | Grad_l2 --> 0.388 | Weights_l2 --> 39460.062 | Lr --> 0.005 | Seconds_per_step --> 2.112 | [2024-04-22 21:20:17,579][Main][INFO] - [train] Step 44700 out of 120000 | Loss --> 2.310 | Grad_l2 --> 0.370 | Weights_l2 --> 39499.187 | Lr --> 0.005 | Seconds_per_step --> 2.122 | [2024-04-22 21:23:48,568][Main][INFO] - [train] Step 44800 out of 120000 | Loss --> 2.291 | Grad_l2 --> 0.375 | Weights_l2 --> 39537.981 | Lr --> 0.005 | Seconds_per_step --> 2.110 | [2024-04-22 21:27:19,895][Main][INFO] - [train] Step 44900 out of 120000 | Loss --> 2.288 | Grad_l2 --> 0.375 | Weights_l2 --> 39576.908 | Lr --> 0.005 | Seconds_per_step --> 2.113 | [2024-04-22 21:30:51,464][Main][INFO] - [train] Step 45000 out of 120000 | Loss --> 2.306 | Grad_l2 --> 0.381 | Weights_l2 --> 39616.006 | Lr --> 0.005 | Seconds_per_step --> 2.116 | [2024-04-22 21:30:51,763][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-22 21:35:13,590][Main][INFO] - [eval] Step 45000 out of 120000 | Loss --> 2.185 | Accuracy --> 0.619 | Time --> 262.124 | [2024-04-22 21:38:48,295][Main][INFO] - [train] Step 45100 out of 120000 | Loss --> 2.294 | Grad_l2 --> 0.387 | Weights_l2 --> 39655.432 | Lr --> 0.005 | Seconds_per_step --> 2.147 | [2024-04-22 21:42:19,139][Main][INFO] - [train] Step 45200 out of 120000 | Loss --> 2.286 | Grad_l2 --> 0.378 | Weights_l2 --> 39694.753 | Lr --> 0.005 | Seconds_per_step --> 2.108 | [2024-04-22 21:45:52,338][Main][INFO] - [train] Step 45300 out of 120000 | Loss --> 2.313 | Grad_l2 --> 0.385 | Weights_l2 --> 39733.596 | Lr --> 0.005 | Seconds_per_step --> 2.132 | [2024-04-22 21:49:23,196][Main][INFO] - [train] Step 45400 out of 120000 | Loss --> 2.308 | Grad_l2 --> 0.387 | Weights_l2 --> 39771.689 | Lr --> 0.005 | Seconds_per_step --> 2.109 | [2024-04-22 21:52:53,945][Main][INFO] - [train] Step 45500 out of 120000 | Loss --> 2.288 | Grad_l2 --> 0.377 | Weights_l2 --> 39810.226 | Lr --> 0.005 | Seconds_per_step --> 2.107 | [2024-04-22 21:56:25,901][Main][INFO] - [train] Step 45600 out of 120000 | Loss --> 2.315 | Grad_l2 --> 0.379 | Weights_l2 --> 39849.065 | Lr --> 0.005 | Seconds_per_step --> 2.120 | [2024-04-22 21:59:56,882][Main][INFO] - [train] Step 45700 out of 120000 | Loss --> 2.307 | Grad_l2 --> 0.381 | Weights_l2 --> 39888.160 | Lr --> 0.005 | Seconds_per_step --> 2.110 | [2024-04-22 22:03:29,583][Main][INFO] - [train] Step 45800 out of 120000 | Loss --> 2.319 | Grad_l2 --> 0.379 | Weights_l2 --> 39927.092 | Lr --> 0.005 | Seconds_per_step --> 2.127 | [2024-04-22 22:07:00,238][Main][INFO] - [train] Step 45900 out of 120000 | Loss --> 2.310 | Grad_l2 --> 0.381 | Weights_l2 --> 39965.996 | Lr --> 0.005 | Seconds_per_step --> 2.107 | [2024-04-22 22:10:33,339][Main][INFO] - [train] Step 46000 out of 120000 | Loss --> 2.304 | Grad_l2 --> 0.390 | Weights_l2 --> 40004.261 | Lr --> 0.005 | Seconds_per_step --> 2.131 | [2024-04-22 22:14:06,426][Main][INFO] - [train] Step 46100 out of 120000 | Loss --> 2.305 | Grad_l2 --> 0.375 | Weights_l2 --> 40042.905 | Lr --> 0.005 | Seconds_per_step --> 2.131 | [2024-04-22 22:17:38,412][Main][INFO] - [train] Step 46200 out of 120000 | Loss --> 2.285 | Grad_l2 --> 0.378 | Weights_l2 --> 40081.306 | Lr --> 0.005 | Seconds_per_step --> 2.120 | [2024-04-22 22:21:08,687][Main][INFO] - [train] Step 46300 out of 120000 | Loss --> 2.311 | Grad_l2 --> 0.380 | Weights_l2 --> 40120.324 | Lr --> 0.005 | Seconds_per_step --> 2.103 | [2024-04-22 22:24:40,776][Main][INFO] - [train] Step 46400 out of 120000 | Loss --> 2.319 | Grad_l2 --> 0.381 | Weights_l2 --> 40158.785 | Lr --> 0.005 | Seconds_per_step --> 2.121 | [2024-04-22 22:28:12,641][Main][INFO] - [train] Step 46500 out of 120000 | Loss --> 2.319 | Grad_l2 --> 0.378 | Weights_l2 --> 40196.584 | Lr --> 0.005 | Seconds_per_step --> 2.119 | [2024-04-22 22:31:44,077][Main][INFO] - [train] Step 46600 out of 120000 | Loss --> 2.311 | Grad_l2 --> 0.384 | Weights_l2 --> 40235.245 | Lr --> 0.005 | Seconds_per_step --> 2.114 | [2024-04-22 22:35:15,981][Main][INFO] - [train] Step 46700 out of 120000 | Loss --> 2.305 | Grad_l2 --> 0.382 | Weights_l2 --> 40273.920 | Lr --> 0.005 | Seconds_per_step --> 2.119 | [2024-04-22 22:38:47,179][Main][INFO] - [train] Step 46800 out of 120000 | Loss --> 2.314 | Grad_l2 --> 0.377 | Weights_l2 --> 40312.484 | Lr --> 0.005 | Seconds_per_step --> 2.112 | [2024-04-22 22:42:18,174][Main][INFO] - [train] Step 46900 out of 120000 | Loss --> 2.298 | Grad_l2 --> 0.388 | Weights_l2 --> 40350.858 | Lr --> 0.005 | Seconds_per_step --> 2.110 | [2024-04-22 22:45:49,639][Main][INFO] - [train] Step 47000 out of 120000 | Loss --> 2.287 | Grad_l2 --> 0.377 | Weights_l2 --> 40388.687 | Lr --> 0.005 | Seconds_per_step --> 2.115 | [2024-04-22 22:49:24,104][Main][INFO] - [train] Step 47100 out of 120000 | Loss --> 2.302 | Grad_l2 --> 0.382 | Weights_l2 --> 40426.972 | Lr --> 0.005 | Seconds_per_step --> 2.145 | [2024-04-22 22:52:55,339][Main][INFO] - [train] Step 47200 out of 120000 | Loss --> 2.291 | Grad_l2 --> 0.373 | Weights_l2 --> 40465.263 | Lr --> 0.005 | Seconds_per_step --> 2.112 | [2024-04-22 22:56:26,915][Main][INFO] - [train] Step 47300 out of 120000 | Loss --> 2.294 | Grad_l2 --> 0.379 | Weights_l2 --> 40502.736 | Lr --> 0.005 | Seconds_per_step --> 2.116 | [2024-04-22 22:59:57,897][Main][INFO] - [train] Step 47400 out of 120000 | Loss --> 2.296 | Grad_l2 --> 0.372 | Weights_l2 --> 40541.065 | Lr --> 0.005 | Seconds_per_step --> 2.110 | [2024-04-22 23:03:28,390][Main][INFO] - [train] Step 47500 out of 120000 | Loss --> 2.287 | Grad_l2 --> 0.373 | Weights_l2 --> 40578.659 | Lr --> 0.005 | Seconds_per_step --> 2.105 | [2024-04-22 23:06:59,767][Main][INFO] - [train] Step 47600 out of 120000 | Loss --> 2.302 | Grad_l2 --> 0.379 | Weights_l2 --> 40616.970 | Lr --> 0.005 | Seconds_per_step --> 2.114 | [2024-04-22 23:10:33,507][Main][INFO] - [train] Step 47700 out of 120000 | Loss --> 2.294 | Grad_l2 --> 0.389 | Weights_l2 --> 40654.813 | Lr --> 0.005 | Seconds_per_step --> 2.137 | [2024-04-22 23:14:04,486][Main][INFO] - [train] Step 47800 out of 120000 | Loss --> 2.300 | Grad_l2 --> 0.377 | Weights_l2 --> 40692.461 | Lr --> 0.005 | Seconds_per_step --> 2.110 | [2024-04-22 23:17:36,506][Main][INFO] - [train] Step 47900 out of 120000 | Loss --> 2.291 | Grad_l2 --> 0.386 | Weights_l2 --> 40730.402 | Lr --> 0.005 | Seconds_per_step --> 2.120 | [2024-04-22 23:21:07,742][Main][INFO] - [train] Step 48000 out of 120000 | Loss --> 2.289 | Grad_l2 --> 0.383 | Weights_l2 --> 40767.664 | Lr --> 0.005 | Seconds_per_step --> 2.112 | [2024-04-22 23:24:39,964][Main][INFO] - [train] Step 48100 out of 120000 | Loss --> 2.292 | Grad_l2 --> 0.378 | Weights_l2 --> 40805.196 | Lr --> 0.005 | Seconds_per_step --> 2.122 | [2024-04-22 23:28:11,341][Main][INFO] - [train] Step 48200 out of 120000 | Loss --> 2.279 | Grad_l2 --> 0.382 | Weights_l2 --> 40842.579 | Lr --> 0.005 | Seconds_per_step --> 2.114 | [2024-04-22 23:31:42,112][Main][INFO] - [train] Step 48300 out of 120000 | Loss --> 2.291 | Grad_l2 --> 0.380 | Weights_l2 --> 40879.716 | Lr --> 0.005 | Seconds_per_step --> 2.108 | [2024-04-22 23:35:14,072][Main][INFO] - [train] Step 48400 out of 120000 | Loss --> 2.271 | Grad_l2 --> 0.380 | Weights_l2 --> 40916.522 | Lr --> 0.005 | Seconds_per_step --> 2.120 | [2024-04-22 23:38:44,837][Main][INFO] - [train] Step 48500 out of 120000 | Loss --> 2.253 | Grad_l2 --> 0.374 | Weights_l2 --> 40953.636 | Lr --> 0.005 | Seconds_per_step --> 2.108 | [2024-04-22 23:42:18,299][Main][INFO] - [train] Step 48600 out of 120000 | Loss --> 2.285 | Grad_l2 --> 0.373 | Weights_l2 --> 40991.129 | Lr --> 0.005 | Seconds_per_step --> 2.135 | [2024-04-22 23:45:47,468][Main][INFO] - [train] Step 48700 out of 120000 | Loss --> 2.272 | Grad_l2 --> 0.377 | Weights_l2 --> 41028.557 | Lr --> 0.005 | Seconds_per_step --> 2.092 | [2024-04-22 23:49:21,100][Main][INFO] - [train] Step 48800 out of 120000 | Loss --> 2.284 | Grad_l2 --> 0.373 | Weights_l2 --> 41065.481 | Lr --> 0.005 | Seconds_per_step --> 2.136 | [2024-04-22 23:52:53,504][Main][INFO] - [train] Step 48900 out of 120000 | Loss --> 2.283 | Grad_l2 --> 0.375 | Weights_l2 --> 41102.898 | Lr --> 0.005 | Seconds_per_step --> 2.124 | [2024-04-22 23:56:22,430][Main][INFO] - [train] Step 49000 out of 120000 | Loss --> 2.291 | Grad_l2 --> 0.379 | Weights_l2 --> 41140.166 | Lr --> 0.005 | Seconds_per_step --> 2.089 | [2024-04-22 23:59:54,839][Main][INFO] - [train] Step 49100 out of 120000 | Loss --> 2.287 | Grad_l2 --> 0.379 | Weights_l2 --> 41177.279 | Lr --> 0.005 | Seconds_per_step --> 2.124 | [2024-04-23 00:03:28,070][Main][INFO] - [train] Step 49200 out of 120000 | Loss --> 2.277 | Grad_l2 --> 0.378 | Weights_l2 --> 41214.828 | Lr --> 0.005 | Seconds_per_step --> 2.132 | [2024-04-23 00:07:01,374][Main][INFO] - [train] Step 49300 out of 120000 | Loss --> 2.274 | Grad_l2 --> 0.375 | Weights_l2 --> 41251.962 | Lr --> 0.005 | Seconds_per_step --> 2.133 | [2024-04-23 00:10:34,608][Main][INFO] - [train] Step 49400 out of 120000 | Loss --> 2.278 | Grad_l2 --> 0.380 | Weights_l2 --> 41289.021 | Lr --> 0.004 | Seconds_per_step --> 2.132 | [2024-04-23 00:14:05,839][Main][INFO] - [train] Step 49500 out of 120000 | Loss --> 2.285 | Grad_l2 --> 0.378 | Weights_l2 --> 41325.727 | Lr --> 0.004 | Seconds_per_step --> 2.112 | [2024-04-23 00:17:37,294][Main][INFO] - [train] Step 49600 out of 120000 | Loss --> 2.286 | Grad_l2 --> 0.388 | Weights_l2 --> 41362.696 | Lr --> 0.004 | Seconds_per_step --> 2.115 | [2024-04-23 00:21:08,384][Main][INFO] - [train] Step 49700 out of 120000 | Loss --> 2.290 | Grad_l2 --> 0.376 | Weights_l2 --> 41400.180 | Lr --> 0.004 | Seconds_per_step --> 2.111 | [2024-04-23 00:24:39,171][Main][INFO] - [train] Step 49800 out of 120000 | Loss --> 2.291 | Grad_l2 --> 0.379 | Weights_l2 --> 41437.452 | Lr --> 0.004 | Seconds_per_step --> 2.108 | [2024-04-23 00:28:13,097][Main][INFO] - [train] Step 49900 out of 120000 | Loss --> 2.288 | Grad_l2 --> 0.392 | Weights_l2 --> 41474.237 | Lr --> 0.004 | Seconds_per_step --> 2.139 | [2024-04-23 00:31:46,199][Main][INFO] - [train] Step 50000 out of 120000 | Loss --> 2.294 | Grad_l2 --> 0.377 | Weights_l2 --> 41511.337 | Lr --> 0.004 | Seconds_per_step --> 2.131 | [2024-04-23 00:31:46,803][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-23 00:36:12,339][Main][INFO] - [eval] Step 50000 out of 120000 | Loss --> 2.140 | Accuracy --> 0.624 | Time --> 266.137 | [2024-04-23 00:36:12,342][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-50000 [2024-04-23 00:36:12,345][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-04-23 00:36:16,882][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-50000/model.safetensors [2024-04-23 00:36:16,934][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-50000/optimizer.bin [2024-04-23 00:36:16,936][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-50000/scheduler.bin [2024-04-23 00:36:16,936][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-50000/sampler.bin [2024-04-23 00:36:16,936][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-50000/sampler_1.bin [2024-04-23 00:36:16,937][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-50000/random_states_0.pkl [2024-04-23 00:39:48,567][Main][INFO] - [train] Step 50100 out of 120000 | Loss --> 2.282 | Grad_l2 --> 0.376 | Weights_l2 --> 41547.934 | Lr --> 0.004 | Seconds_per_step --> 2.162 | [2024-04-23 00:43:20,805][Main][INFO] - [train] Step 50200 out of 120000 | Loss --> 2.282 | Grad_l2 --> 0.385 | Weights_l2 --> 41584.929 | Lr --> 0.004 | Seconds_per_step --> 2.122 | [2024-04-23 00:46:50,596][Main][INFO] - [train] Step 50300 out of 120000 | Loss --> 2.304 | Grad_l2 --> 0.381 | Weights_l2 --> 41621.796 | Lr --> 0.004 | Seconds_per_step --> 2.098 | [2024-04-23 00:50:22,016][Main][INFO] - [train] Step 50400 out of 120000 | Loss --> 2.307 | Grad_l2 --> 0.382 | Weights_l2 --> 41658.790 | Lr --> 0.004 | Seconds_per_step --> 2.114 | [2024-04-23 00:53:56,374][Main][INFO] - [train] Step 50500 out of 120000 | Loss --> 2.298 | Grad_l2 --> 0.371 | Weights_l2 --> 41695.787 | Lr --> 0.004 | Seconds_per_step --> 2.144 | [2024-04-23 00:57:28,096][Main][INFO] - [train] Step 50600 out of 120000 | Loss --> 2.308 | Grad_l2 --> 0.377 | Weights_l2 --> 41732.568 | Lr --> 0.004 | Seconds_per_step --> 2.117 | [2024-04-23 01:00:59,696][Main][INFO] - [train] Step 50700 out of 120000 | Loss --> 2.295 | Grad_l2 --> 0.392 | Weights_l2 --> 41769.415 | Lr --> 0.004 | Seconds_per_step --> 2.116 | [2024-04-23 01:04:30,448][Main][INFO] - [train] Step 50800 out of 120000 | Loss --> 2.287 | Grad_l2 --> 0.388 | Weights_l2 --> 41806.139 | Lr --> 0.004 | Seconds_per_step --> 2.108 | [2024-04-23 01:08:01,797][Main][INFO] - [train] Step 50900 out of 120000 | Loss --> 2.300 | Grad_l2 --> 0.377 | Weights_l2 --> 41842.866 | Lr --> 0.004 | Seconds_per_step --> 2.113 | [2024-04-23 01:11:35,894][Main][INFO] - [train] Step 51000 out of 120000 | Loss --> 2.308 | Grad_l2 --> 0.382 | Weights_l2 --> 41879.856 | Lr --> 0.004 | Seconds_per_step --> 2.141 | [2024-04-23 01:15:04,399][Main][INFO] - [train] Step 51100 out of 120000 | Loss --> 2.287 | Grad_l2 --> 0.376 | Weights_l2 --> 41916.878 | Lr --> 0.004 | Seconds_per_step --> 2.085 | [2024-04-23 01:18:35,796][Main][INFO] - [train] Step 51200 out of 120000 | Loss --> 2.278 | Grad_l2 --> 0.395 | Weights_l2 --> 41953.698 | Lr --> 0.004 | Seconds_per_step --> 2.114 | [2024-04-23 01:22:07,168][Main][INFO] - [train] Step 51300 out of 120000 | Loss --> 2.283 | Grad_l2 --> 0.380 | Weights_l2 --> 41990.156 | Lr --> 0.004 | Seconds_per_step --> 2.114 | [2024-04-23 01:25:39,303][Main][INFO] - [train] Step 51400 out of 120000 | Loss --> 2.273 | Grad_l2 --> 0.381 | Weights_l2 --> 42027.225 | Lr --> 0.004 | Seconds_per_step --> 2.121 | [2024-04-23 01:29:11,281][Main][INFO] - [train] Step 51500 out of 120000 | Loss --> 2.270 | Grad_l2 --> 0.387 | Weights_l2 --> 42063.423 | Lr --> 0.004 | Seconds_per_step --> 2.120 | [2024-04-23 01:32:46,893][Main][INFO] - [train] Step 51600 out of 120000 | Loss --> 2.266 | Grad_l2 --> 0.373 | Weights_l2 --> 42099.641 | Lr --> 0.004 | Seconds_per_step --> 2.156 | [2024-04-23 01:36:17,866][Main][INFO] - [train] Step 51700 out of 120000 | Loss --> 2.272 | Grad_l2 --> 0.379 | Weights_l2 --> 42136.562 | Lr --> 0.004 | Seconds_per_step --> 2.110 | [2024-04-23 01:39:50,898][Main][INFO] - [train] Step 51800 out of 120000 | Loss --> 2.273 | Grad_l2 --> 0.380 | Weights_l2 --> 42173.021 | Lr --> 0.004 | Seconds_per_step --> 2.130 | [2024-04-23 01:43:23,795][Main][INFO] - [train] Step 51900 out of 120000 | Loss --> 2.265 | Grad_l2 --> 0.384 | Weights_l2 --> 42209.480 | Lr --> 0.004 | Seconds_per_step --> 2.129 | [2024-04-23 01:46:54,756][Main][INFO] - [train] Step 52000 out of 120000 | Loss --> 2.264 | Grad_l2 --> 0.389 | Weights_l2 --> 42246.645 | Lr --> 0.004 | Seconds_per_step --> 2.110 | [2024-04-23 01:50:26,738][Main][INFO] - [train] Step 52100 out of 120000 | Loss --> 2.265 | Grad_l2 --> 0.377 | Weights_l2 --> 42283.366 | Lr --> 0.004 | Seconds_per_step --> 2.120 | [2024-04-23 01:53:58,210][Main][INFO] - [train] Step 52200 out of 120000 | Loss --> 2.292 | Grad_l2 --> 0.380 | Weights_l2 --> 42319.941 | Lr --> 0.004 | Seconds_per_step --> 2.115 | [2024-04-23 01:57:30,901][Main][INFO] - [train] Step 52300 out of 120000 | Loss --> 2.278 | Grad_l2 --> 0.388 | Weights_l2 --> 42355.944 | Lr --> 0.004 | Seconds_per_step --> 2.127 | [2024-04-23 02:01:02,466][Main][INFO] - [train] Step 52400 out of 120000 | Loss --> 2.274 | Grad_l2 --> 0.381 | Weights_l2 --> 42391.915 | Lr --> 0.004 | Seconds_per_step --> 2.116 | [2024-04-23 02:04:31,881][Main][INFO] - [train] Step 52500 out of 120000 | Loss --> 2.270 | Grad_l2 --> 0.379 | Weights_l2 --> 42428.077 | Lr --> 0.004 | Seconds_per_step --> 2.094 | [2024-04-23 02:08:05,199][Main][INFO] - [train] Step 52600 out of 120000 | Loss --> 2.297 | Grad_l2 --> 0.382 | Weights_l2 --> 42464.314 | Lr --> 0.004 | Seconds_per_step --> 2.133 | [2024-04-23 02:11:36,696][Main][INFO] - [train] Step 52700 out of 120000 | Loss --> 2.279 | Grad_l2 --> 0.386 | Weights_l2 --> 42500.623 | Lr --> 0.004 | Seconds_per_step --> 2.115 | [2024-04-23 02:15:08,401][Main][INFO] - [train] Step 52800 out of 120000 | Loss --> 2.267 | Grad_l2 --> 0.375 | Weights_l2 --> 42536.946 | Lr --> 0.004 | Seconds_per_step --> 2.117 | [2024-04-23 02:18:40,045][Main][INFO] - [train] Step 52900 out of 120000 | Loss --> 2.282 | Grad_l2 --> 0.385 | Weights_l2 --> 42572.986 | Lr --> 0.004 | Seconds_per_step --> 2.116 | [2024-04-23 02:22:12,398][Main][INFO] - [train] Step 53000 out of 120000 | Loss --> 2.260 | Grad_l2 --> 0.378 | Weights_l2 --> 42609.125 | Lr --> 0.004 | Seconds_per_step --> 2.124 | [2024-04-23 02:25:45,172][Main][INFO] - [train] Step 53100 out of 120000 | Loss --> 2.269 | Grad_l2 --> 0.384 | Weights_l2 --> 42645.271 | Lr --> 0.004 | Seconds_per_step --> 2.128 | [2024-04-23 02:29:17,102][Main][INFO] - [train] Step 53200 out of 120000 | Loss --> 2.272 | Grad_l2 --> 0.391 | Weights_l2 --> 42681.356 | Lr --> 0.004 | Seconds_per_step --> 2.119 | [2024-04-23 02:32:50,653][Main][INFO] - [train] Step 53300 out of 120000 | Loss --> 2.279 | Grad_l2 --> 0.381 | Weights_l2 --> 42716.708 | Lr --> 0.004 | Seconds_per_step --> 2.136 | [2024-04-23 02:36:25,696][Main][INFO] - [train] Step 53400 out of 120000 | Loss --> 2.283 | Grad_l2 --> 0.380 | Weights_l2 --> 42752.715 | Lr --> 0.004 | Seconds_per_step --> 2.150 | [2024-04-23 02:39:57,880][Main][INFO] - [train] Step 53500 out of 120000 | Loss --> 2.277 | Grad_l2 --> 0.383 | Weights_l2 --> 42788.656 | Lr --> 0.004 | Seconds_per_step --> 2.122 | [2024-04-23 02:43:29,798][Main][INFO] - [train] Step 53600 out of 120000 | Loss --> 2.281 | Grad_l2 --> 0.385 | Weights_l2 --> 42824.245 | Lr --> 0.004 | Seconds_per_step --> 2.119 | [2024-04-23 02:47:01,964][Main][INFO] - [train] Step 53700 out of 120000 | Loss --> 2.276 | Grad_l2 --> 0.376 | Weights_l2 --> 42859.710 | Lr --> 0.004 | Seconds_per_step --> 2.122 | [2024-04-23 02:50:32,002][Main][INFO] - [train] Step 53800 out of 120000 | Loss --> 2.259 | Grad_l2 --> 0.393 | Weights_l2 --> 42895.749 | Lr --> 0.004 | Seconds_per_step --> 2.100 | [2024-04-23 02:54:05,437][Main][INFO] - [train] Step 53900 out of 120000 | Loss --> 2.268 | Grad_l2 --> 0.376 | Weights_l2 --> 42931.403 | Lr --> 0.004 | Seconds_per_step --> 2.134 | [2024-04-23 02:57:37,545][Main][INFO] - [train] Step 54000 out of 120000 | Loss --> 2.272 | Grad_l2 --> 0.383 | Weights_l2 --> 42966.538 | Lr --> 0.004 | Seconds_per_step --> 2.121 | [2024-04-23 03:01:07,581][Main][INFO] - [train] Step 54100 out of 120000 | Loss --> 2.252 | Grad_l2 --> 0.388 | Weights_l2 --> 43001.995 | Lr --> 0.004 | Seconds_per_step --> 2.100 | [2024-04-23 03:04:40,365][Main][INFO] - [train] Step 54200 out of 120000 | Loss --> 2.246 | Grad_l2 --> 0.378 | Weights_l2 --> 43037.185 | Lr --> 0.004 | Seconds_per_step --> 2.128 | [2024-04-23 03:08:10,175][Main][INFO] - [train] Step 54300 out of 120000 | Loss --> 2.243 | Grad_l2 --> 0.383 | Weights_l2 --> 43072.438 | Lr --> 0.004 | Seconds_per_step --> 2.098 | [2024-04-23 03:11:43,487][Main][INFO] - [train] Step 54400 out of 120000 | Loss --> 2.266 | Grad_l2 --> 0.384 | Weights_l2 --> 43107.989 | Lr --> 0.004 | Seconds_per_step --> 2.133 | [2024-04-23 03:15:15,993][Main][INFO] - [train] Step 54500 out of 120000 | Loss --> 2.260 | Grad_l2 --> 0.378 | Weights_l2 --> 43143.989 | Lr --> 0.004 | Seconds_per_step --> 2.125 | [2024-04-23 03:18:46,885][Main][INFO] - [train] Step 54600 out of 120000 | Loss --> 2.238 | Grad_l2 --> 0.378 | Weights_l2 --> 43179.391 | Lr --> 0.004 | Seconds_per_step --> 2.109 | [2024-04-23 03:22:17,695][Main][INFO] - [train] Step 54700 out of 120000 | Loss --> 2.248 | Grad_l2 --> 0.382 | Weights_l2 --> 43214.968 | Lr --> 0.004 | Seconds_per_step --> 2.108 | [2024-04-23 03:25:48,439][Main][INFO] - [train] Step 54800 out of 120000 | Loss --> 2.256 | Grad_l2 --> 0.384 | Weights_l2 --> 43249.891 | Lr --> 0.004 | Seconds_per_step --> 2.107 | [2024-04-23 03:29:22,741][Main][INFO] - [train] Step 54900 out of 120000 | Loss --> 2.259 | Grad_l2 --> 0.382 | Weights_l2 --> 43285.173 | Lr --> 0.004 | Seconds_per_step --> 2.143 | [2024-04-23 03:32:51,268][Main][INFO] - [train] Step 55000 out of 120000 | Loss --> 2.246 | Grad_l2 --> 0.399 | Weights_l2 --> 43320.386 | Lr --> 0.004 | Seconds_per_step --> 2.085 | [2024-04-23 03:32:51,506][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-23 03:37:13,397][Main][INFO] - [eval] Step 55000 out of 120000 | Loss --> 2.108 | Accuracy --> 0.630 | Time --> 262.126 | [2024-04-23 03:40:45,940][Main][INFO] - [train] Step 55100 out of 120000 | Loss --> 2.253 | Grad_l2 --> 0.381 | Weights_l2 --> 43355.297 | Lr --> 0.004 | Seconds_per_step --> 2.125 | [2024-04-23 03:44:17,664][Main][INFO] - [train] Step 55200 out of 120000 | Loss --> 2.239 | Grad_l2 --> 0.375 | Weights_l2 --> 43390.034 | Lr --> 0.004 | Seconds_per_step --> 2.117 | [2024-04-23 03:47:49,093][Main][INFO] - [train] Step 55300 out of 120000 | Loss --> 2.245 | Grad_l2 --> 0.382 | Weights_l2 --> 43424.562 | Lr --> 0.004 | Seconds_per_step --> 2.114 | [2024-04-23 03:51:21,497][Main][INFO] - [train] Step 55400 out of 120000 | Loss --> 2.231 | Grad_l2 --> 0.376 | Weights_l2 --> 43459.239 | Lr --> 0.004 | Seconds_per_step --> 2.124 | [2024-04-23 03:54:54,681][Main][INFO] - [train] Step 55500 out of 120000 | Loss --> 2.264 | Grad_l2 --> 0.396 | Weights_l2 --> 43493.834 | Lr --> 0.004 | Seconds_per_step --> 2.132 | [2024-04-23 03:58:27,646][Main][INFO] - [train] Step 55600 out of 120000 | Loss --> 2.245 | Grad_l2 --> 0.379 | Weights_l2 --> 43528.416 | Lr --> 0.004 | Seconds_per_step --> 2.130 | [2024-04-23 04:02:01,283][Main][INFO] - [train] Step 55700 out of 120000 | Loss --> 2.236 | Grad_l2 --> 0.379 | Weights_l2 --> 43562.040 | Lr --> 0.004 | Seconds_per_step --> 2.136 | [2024-04-23 04:05:32,639][Main][INFO] - [train] Step 55800 out of 120000 | Loss --> 2.238 | Grad_l2 --> 0.387 | Weights_l2 --> 43596.910 | Lr --> 0.004 | Seconds_per_step --> 2.114 | [2024-04-23 04:09:06,782][Main][INFO] - [train] Step 55900 out of 120000 | Loss --> 2.245 | Grad_l2 --> 0.389 | Weights_l2 --> 43632.002 | Lr --> 0.004 | Seconds_per_step --> 2.141 | [2024-04-23 04:12:37,939][Main][INFO] - [train] Step 56000 out of 120000 | Loss --> 2.258 | Grad_l2 --> 0.381 | Weights_l2 --> 43666.740 | Lr --> 0.004 | Seconds_per_step --> 2.112 | [2024-04-23 04:16:11,072][Main][INFO] - [train] Step 56100 out of 120000 | Loss --> 2.258 | Grad_l2 --> 0.384 | Weights_l2 --> 43701.700 | Lr --> 0.004 | Seconds_per_step --> 2.131 | [2024-04-23 04:19:44,140][Main][INFO] - [train] Step 56200 out of 120000 | Loss --> 2.235 | Grad_l2 --> 0.381 | Weights_l2 --> 43736.415 | Lr --> 0.004 | Seconds_per_step --> 2.131 | [2024-04-23 04:23:15,993][Main][INFO] - [train] Step 56300 out of 120000 | Loss --> 2.242 | Grad_l2 --> 0.381 | Weights_l2 --> 43770.819 | Lr --> 0.004 | Seconds_per_step --> 2.119 | [2024-04-23 04:26:47,644][Main][INFO] - [train] Step 56400 out of 120000 | Loss --> 2.254 | Grad_l2 --> 0.388 | Weights_l2 --> 43805.240 | Lr --> 0.004 | Seconds_per_step --> 2.117 | [2024-04-23 04:30:17,968][Main][INFO] - [train] Step 56500 out of 120000 | Loss --> 2.261 | Grad_l2 --> 0.385 | Weights_l2 --> 43839.328 | Lr --> 0.004 | Seconds_per_step --> 2.103 | [2024-04-23 04:33:48,493][Main][INFO] - [train] Step 56600 out of 120000 | Loss --> 2.265 | Grad_l2 --> 0.388 | Weights_l2 --> 43873.525 | Lr --> 0.004 | Seconds_per_step --> 2.105 | [2024-04-23 04:37:21,181][Main][INFO] - [train] Step 56700 out of 120000 | Loss --> 2.257 | Grad_l2 --> 0.385 | Weights_l2 --> 43907.643 | Lr --> 0.004 | Seconds_per_step --> 2.127 | [2024-04-23 04:40:53,898][Main][INFO] - [train] Step 56800 out of 120000 | Loss --> 2.255 | Grad_l2 --> 0.377 | Weights_l2 --> 43942.242 | Lr --> 0.004 | Seconds_per_step --> 2.127 | [2024-04-23 04:44:25,097][Main][INFO] - [train] Step 56900 out of 120000 | Loss --> 2.256 | Grad_l2 --> 0.381 | Weights_l2 --> 43976.616 | Lr --> 0.004 | Seconds_per_step --> 2.112 | [2024-04-23 04:47:56,147][Main][INFO] - [train] Step 57000 out of 120000 | Loss --> 2.234 | Grad_l2 --> 0.380 | Weights_l2 --> 44010.554 | Lr --> 0.004 | Seconds_per_step --> 2.110 | [2024-04-23 04:51:28,040][Main][INFO] - [train] Step 57100 out of 120000 | Loss --> 2.239 | Grad_l2 --> 0.375 | Weights_l2 --> 44044.460 | Lr --> 0.004 | Seconds_per_step --> 2.119 | [2024-04-23 04:55:00,947][Main][INFO] - [train] Step 57200 out of 120000 | Loss --> 2.237 | Grad_l2 --> 0.386 | Weights_l2 --> 44078.807 | Lr --> 0.004 | Seconds_per_step --> 2.129 | [2024-04-23 04:58:30,194][Main][INFO] - [train] Step 57300 out of 120000 | Loss --> 2.228 | Grad_l2 --> 0.379 | Weights_l2 --> 44112.480 | Lr --> 0.004 | Seconds_per_step --> 2.092 | [2024-04-23 05:02:01,938][Main][INFO] - [train] Step 57400 out of 120000 | Loss --> 2.226 | Grad_l2 --> 0.380 | Weights_l2 --> 44146.680 | Lr --> 0.004 | Seconds_per_step --> 2.117 | [2024-04-23 05:05:36,239][Main][INFO] - [train] Step 57500 out of 120000 | Loss --> 2.223 | Grad_l2 --> 0.376 | Weights_l2 --> 44180.146 | Lr --> 0.004 | Seconds_per_step --> 2.143 | [2024-04-23 05:09:09,293][Main][INFO] - [train] Step 57600 out of 120000 | Loss --> 2.231 | Grad_l2 --> 0.383 | Weights_l2 --> 44214.255 | Lr --> 0.004 | Seconds_per_step --> 2.131 | [2024-04-23 05:12:39,797][Main][INFO] - [train] Step 57700 out of 120000 | Loss --> 2.225 | Grad_l2 --> 0.385 | Weights_l2 --> 44248.378 | Lr --> 0.004 | Seconds_per_step --> 2.105 | [2024-04-23 05:16:14,247][Main][INFO] - [train] Step 57800 out of 120000 | Loss --> 2.243 | Grad_l2 --> 0.380 | Weights_l2 --> 44282.408 | Lr --> 0.004 | Seconds_per_step --> 2.144 | [2024-04-23 05:19:46,000][Main][INFO] - [train] Step 57900 out of 120000 | Loss --> 2.234 | Grad_l2 --> 0.384 | Weights_l2 --> 44316.500 | Lr --> 0.004 | Seconds_per_step --> 2.118 | [2024-04-23 05:23:18,067][Main][INFO] - [train] Step 58000 out of 120000 | Loss --> 2.223 | Grad_l2 --> 0.380 | Weights_l2 --> 44349.952 | Lr --> 0.004 | Seconds_per_step --> 2.121 | [2024-04-23 05:26:49,478][Main][INFO] - [train] Step 58100 out of 120000 | Loss --> 2.233 | Grad_l2 --> 0.373 | Weights_l2 --> 44383.793 | Lr --> 0.004 | Seconds_per_step --> 2.114 | [2024-04-23 05:30:21,310][Main][INFO] - [train] Step 58200 out of 120000 | Loss --> 2.223 | Grad_l2 --> 0.384 | Weights_l2 --> 44417.088 | Lr --> 0.004 | Seconds_per_step --> 2.118 | [2024-04-23 05:33:52,769][Main][INFO] - [train] Step 58300 out of 120000 | Loss --> 2.196 | Grad_l2 --> 0.376 | Weights_l2 --> 44450.397 | Lr --> 0.004 | Seconds_per_step --> 2.115 | [2024-04-23 05:37:24,211][Main][INFO] - [train] Step 58400 out of 120000 | Loss --> 2.218 | Grad_l2 --> 0.383 | Weights_l2 --> 44483.879 | Lr --> 0.004 | Seconds_per_step --> 2.114 | [2024-04-23 05:40:56,647][Main][INFO] - [train] Step 58500 out of 120000 | Loss --> 2.209 | Grad_l2 --> 0.382 | Weights_l2 --> 44517.406 | Lr --> 0.004 | Seconds_per_step --> 2.124 | [2024-04-23 05:44:29,499][Main][INFO] - [train] Step 58600 out of 120000 | Loss --> 2.210 | Grad_l2 --> 0.378 | Weights_l2 --> 44551.142 | Lr --> 0.004 | Seconds_per_step --> 2.129 | [2024-04-23 05:48:02,749][Main][INFO] - [train] Step 58700 out of 120000 | Loss --> 2.213 | Grad_l2 --> 0.384 | Weights_l2 --> 44584.593 | Lr --> 0.004 | Seconds_per_step --> 2.132 | [2024-04-23 05:51:36,584][Main][INFO] - [train] Step 58800 out of 120000 | Loss --> 2.207 | Grad_l2 --> 0.373 | Weights_l2 --> 44617.630 | Lr --> 0.004 | Seconds_per_step --> 2.138 | [2024-04-23 05:55:06,549][Main][INFO] - [train] Step 58900 out of 120000 | Loss --> 2.210 | Grad_l2 --> 0.382 | Weights_l2 --> 44651.058 | Lr --> 0.004 | Seconds_per_step --> 2.100 | [2024-04-23 05:58:38,196][Main][INFO] - [train] Step 59000 out of 120000 | Loss --> 2.232 | Grad_l2 --> 0.377 | Weights_l2 --> 44684.705 | Lr --> 0.004 | Seconds_per_step --> 2.116 | [2024-04-23 06:02:08,664][Main][INFO] - [train] Step 59100 out of 120000 | Loss --> 2.204 | Grad_l2 --> 0.382 | Weights_l2 --> 44718.038 | Lr --> 0.004 | Seconds_per_step --> 2.105 | [2024-04-23 06:05:40,939][Main][INFO] - [train] Step 59200 out of 120000 | Loss --> 2.201 | Grad_l2 --> 0.383 | Weights_l2 --> 44751.719 | Lr --> 0.004 | Seconds_per_step --> 2.123 | [2024-04-23 06:09:14,637][Main][INFO] - [train] Step 59300 out of 120000 | Loss --> 2.214 | Grad_l2 --> 0.384 | Weights_l2 --> 44785.038 | Lr --> 0.004 | Seconds_per_step --> 2.137 | [2024-04-23 06:12:43,664][Main][INFO] - [train] Step 59400 out of 120000 | Loss --> 2.188 | Grad_l2 --> 0.382 | Weights_l2 --> 44818.779 | Lr --> 0.004 | Seconds_per_step --> 2.090 | [2024-04-23 06:16:16,791][Main][INFO] - [train] Step 59500 out of 120000 | Loss --> 2.207 | Grad_l2 --> 0.386 | Weights_l2 --> 44852.475 | Lr --> 0.004 | Seconds_per_step --> 2.131 | [2024-04-23 06:19:49,843][Main][INFO] - [train] Step 59600 out of 120000 | Loss --> 2.209 | Grad_l2 --> 0.377 | Weights_l2 --> 44885.961 | Lr --> 0.004 | Seconds_per_step --> 2.131 | [2024-04-23 06:23:19,134][Main][INFO] - [train] Step 59700 out of 120000 | Loss --> 2.219 | Grad_l2 --> 0.384 | Weights_l2 --> 44919.278 | Lr --> 0.004 | Seconds_per_step --> 2.093 | [2024-04-23 06:26:53,152][Main][INFO] - [train] Step 59800 out of 120000 | Loss --> 2.228 | Grad_l2 --> 0.382 | Weights_l2 --> 44952.824 | Lr --> 0.004 | Seconds_per_step --> 2.140 | [2024-04-23 06:30:26,340][Main][INFO] - [train] Step 59900 out of 120000 | Loss --> 2.220 | Grad_l2 --> 0.387 | Weights_l2 --> 44986.187 | Lr --> 0.004 | Seconds_per_step --> 2.132 | [2024-04-23 06:33:57,038][Main][INFO] - [train] Step 60000 out of 120000 | Loss --> 2.232 | Grad_l2 --> 0.383 | Weights_l2 --> 45019.259 | Lr --> 0.004 | Seconds_per_step --> 2.107 | [2024-04-23 06:33:57,285][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-23 06:38:19,588][Main][INFO] - [eval] Step 60000 out of 120000 | Loss --> 2.077 | Accuracy --> 0.633 | Time --> 262.548 | [2024-04-23 06:38:19,592][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-60000 [2024-04-23 06:38:19,595][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-04-23 06:38:24,014][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-60000/model.safetensors [2024-04-23 06:38:24,067][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-60000/optimizer.bin [2024-04-23 06:38:24,069][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-60000/scheduler.bin [2024-04-23 06:38:24,069][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-60000/sampler.bin [2024-04-23 06:38:24,069][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-60000/sampler_1.bin [2024-04-23 06:38:24,070][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-60000/random_states_0.pkl [2024-04-23 06:41:56,639][Main][INFO] - [train] Step 60100 out of 120000 | Loss --> 2.207 | Grad_l2 --> 0.382 | Weights_l2 --> 45052.104 | Lr --> 0.004 | Seconds_per_step --> 2.170 | [2024-04-23 06:45:26,341][Main][INFO] - [train] Step 60200 out of 120000 | Loss --> 2.227 | Grad_l2 --> 0.384 | Weights_l2 --> 45086.307 | Lr --> 0.004 | Seconds_per_step --> 2.097 | [2024-04-23 06:49:01,150][Main][INFO] - [train] Step 60300 out of 120000 | Loss --> 2.240 | Grad_l2 --> 0.378 | Weights_l2 --> 45119.729 | Lr --> 0.004 | Seconds_per_step --> 2.148 | [2024-04-23 06:52:33,205][Main][INFO] - [train] Step 60400 out of 120000 | Loss --> 2.225 | Grad_l2 --> 0.387 | Weights_l2 --> 45153.328 | Lr --> 0.004 | Seconds_per_step --> 2.121 | [2024-04-23 06:56:04,802][Main][INFO] - [train] Step 60500 out of 120000 | Loss --> 2.226 | Grad_l2 --> 0.382 | Weights_l2 --> 45186.264 | Lr --> 0.004 | Seconds_per_step --> 2.116 | [2024-04-23 06:59:37,343][Main][INFO] - [train] Step 60600 out of 120000 | Loss --> 2.216 | Grad_l2 --> 0.378 | Weights_l2 --> 45219.348 | Lr --> 0.004 | Seconds_per_step --> 2.125 | [2024-04-23 07:03:06,040][Main][INFO] - [train] Step 60700 out of 120000 | Loss --> 2.223 | Grad_l2 --> 0.387 | Weights_l2 --> 45252.201 | Lr --> 0.004 | Seconds_per_step --> 2.087 | [2024-04-23 07:06:37,495][Main][INFO] - [train] Step 60800 out of 120000 | Loss --> 2.213 | Grad_l2 --> 0.374 | Weights_l2 --> 45285.361 | Lr --> 0.004 | Seconds_per_step --> 2.115 | [2024-04-23 07:10:09,965][Main][INFO] - [train] Step 60900 out of 120000 | Loss --> 2.224 | Grad_l2 --> 0.387 | Weights_l2 --> 45318.173 | Lr --> 0.004 | Seconds_per_step --> 2.125 | [2024-04-23 07:13:43,502][Main][INFO] - [train] Step 61000 out of 120000 | Loss --> 2.232 | Grad_l2 --> 0.389 | Weights_l2 --> 45351.218 | Lr --> 0.004 | Seconds_per_step --> 2.135 | [2024-04-23 07:17:16,872][Main][INFO] - [train] Step 61100 out of 120000 | Loss --> 2.226 | Grad_l2 --> 0.394 | Weights_l2 --> 45384.830 | Lr --> 0.004 | Seconds_per_step --> 2.134 | [2024-04-23 07:20:50,599][Main][INFO] - [train] Step 61200 out of 120000 | Loss --> 2.219 | Grad_l2 --> 0.389 | Weights_l2 --> 45417.947 | Lr --> 0.004 | Seconds_per_step --> 2.137 | [2024-04-23 07:24:23,674][Main][INFO] - [train] Step 61300 out of 120000 | Loss --> 2.207 | Grad_l2 --> 0.380 | Weights_l2 --> 45450.997 | Lr --> 0.004 | Seconds_per_step --> 2.131 | [2024-04-23 07:27:54,370][Main][INFO] - [train] Step 61400 out of 120000 | Loss --> 2.201 | Grad_l2 --> 0.378 | Weights_l2 --> 45483.697 | Lr --> 0.004 | Seconds_per_step --> 2.107 | [2024-04-23 07:31:26,837][Main][INFO] - [train] Step 61500 out of 120000 | Loss --> 2.200 | Grad_l2 --> 0.383 | Weights_l2 --> 45516.162 | Lr --> 0.004 | Seconds_per_step --> 2.125 | [2024-04-23 07:34:58,294][Main][INFO] - [train] Step 61600 out of 120000 | Loss --> 2.208 | Grad_l2 --> 0.385 | Weights_l2 --> 45548.640 | Lr --> 0.004 | Seconds_per_step --> 2.115 | [2024-04-23 07:38:29,590][Main][INFO] - [train] Step 61700 out of 120000 | Loss --> 2.200 | Grad_l2 --> 0.382 | Weights_l2 --> 45580.972 | Lr --> 0.004 | Seconds_per_step --> 2.113 | [2024-04-23 07:42:03,703][Main][INFO] - [train] Step 61800 out of 120000 | Loss --> 2.203 | Grad_l2 --> 0.383 | Weights_l2 --> 45613.520 | Lr --> 0.004 | Seconds_per_step --> 2.141 | [2024-04-23 07:45:34,238][Main][INFO] - [train] Step 61900 out of 120000 | Loss --> 2.214 | Grad_l2 --> 0.378 | Weights_l2 --> 45646.039 | Lr --> 0.004 | Seconds_per_step --> 2.105 | [2024-04-23 07:49:05,567][Main][INFO] - [train] Step 62000 out of 120000 | Loss --> 2.198 | Grad_l2 --> 0.377 | Weights_l2 --> 45678.528 | Lr --> 0.004 | Seconds_per_step --> 2.113 | [2024-04-23 07:52:37,878][Main][INFO] - [train] Step 62100 out of 120000 | Loss --> 2.210 | Grad_l2 --> 0.380 | Weights_l2 --> 45711.529 | Lr --> 0.004 | Seconds_per_step --> 2.123 | [2024-04-23 07:56:10,807][Main][INFO] - [train] Step 62200 out of 120000 | Loss --> 2.211 | Grad_l2 --> 0.386 | Weights_l2 --> 45744.171 | Lr --> 0.004 | Seconds_per_step --> 2.129 | [2024-04-23 07:59:44,087][Main][INFO] - [train] Step 62300 out of 120000 | Loss --> 2.203 | Grad_l2 --> 0.372 | Weights_l2 --> 45776.698 | Lr --> 0.004 | Seconds_per_step --> 2.133 | [2024-04-23 08:03:14,579][Main][INFO] - [train] Step 62400 out of 120000 | Loss --> 2.228 | Grad_l2 --> 0.375 | Weights_l2 --> 45809.148 | Lr --> 0.004 | Seconds_per_step --> 2.105 | [2024-04-23 08:06:48,938][Main][INFO] - [train] Step 62500 out of 120000 | Loss --> 2.229 | Grad_l2 --> 0.383 | Weights_l2 --> 45842.055 | Lr --> 0.004 | Seconds_per_step --> 2.144 | [2024-04-23 08:10:19,470][Main][INFO] - [train] Step 62600 out of 120000 | Loss --> 2.203 | Grad_l2 --> 0.381 | Weights_l2 --> 45874.775 | Lr --> 0.004 | Seconds_per_step --> 2.105 | [2024-04-23 08:13:50,187][Main][INFO] - [train] Step 62700 out of 120000 | Loss --> 2.205 | Grad_l2 --> 0.376 | Weights_l2 --> 45907.035 | Lr --> 0.004 | Seconds_per_step --> 2.107 | [2024-04-23 08:17:25,038][Main][INFO] - [train] Step 62800 out of 120000 | Loss --> 2.190 | Grad_l2 --> 0.382 | Weights_l2 --> 45939.836 | Lr --> 0.004 | Seconds_per_step --> 2.149 | [2024-04-23 08:20:58,896][Main][INFO] - [train] Step 62900 out of 120000 | Loss --> 2.211 | Grad_l2 --> 0.380 | Weights_l2 --> 45972.244 | Lr --> 0.004 | Seconds_per_step --> 2.139 | [2024-04-23 08:24:29,941][Main][INFO] - [train] Step 63000 out of 120000 | Loss --> 2.219 | Grad_l2 --> 0.401 | Weights_l2 --> 46004.470 | Lr --> 0.004 | Seconds_per_step --> 2.110 | [2024-04-23 08:28:00,737][Main][INFO] - [train] Step 63100 out of 120000 | Loss --> 2.211 | Grad_l2 --> 0.384 | Weights_l2 --> 46037.360 | Lr --> 0.004 | Seconds_per_step --> 2.108 | [2024-04-23 08:31:31,938][Main][INFO] - [train] Step 63200 out of 120000 | Loss --> 2.203 | Grad_l2 --> 0.385 | Weights_l2 --> 46069.902 | Lr --> 0.004 | Seconds_per_step --> 2.112 | [2024-04-23 08:35:06,073][Main][INFO] - [train] Step 63300 out of 120000 | Loss --> 2.200 | Grad_l2 --> 0.388 | Weights_l2 --> 46102.444 | Lr --> 0.004 | Seconds_per_step --> 2.141 | [2024-04-23 08:38:40,284][Main][INFO] - [train] Step 63400 out of 120000 | Loss --> 2.206 | Grad_l2 --> 0.387 | Weights_l2 --> 46135.019 | Lr --> 0.004 | Seconds_per_step --> 2.142 | [2024-04-23 08:42:11,252][Main][INFO] - [train] Step 63500 out of 120000 | Loss --> 2.196 | Grad_l2 --> 0.383 | Weights_l2 --> 46167.332 | Lr --> 0.004 | Seconds_per_step --> 2.110 | [2024-04-23 08:45:47,576][Main][INFO] - [train] Step 63600 out of 120000 | Loss --> 2.195 | Grad_l2 --> 0.387 | Weights_l2 --> 46200.103 | Lr --> 0.004 | Seconds_per_step --> 2.163 | [2024-04-23 08:49:21,201][Main][INFO] - [train] Step 63700 out of 120000 | Loss --> 2.221 | Grad_l2 --> 0.382 | Weights_l2 --> 46232.980 | Lr --> 0.004 | Seconds_per_step --> 2.136 | [2024-04-23 08:52:51,038][Main][INFO] - [train] Step 63800 out of 120000 | Loss --> 2.181 | Grad_l2 --> 0.383 | Weights_l2 --> 46265.834 | Lr --> 0.004 | Seconds_per_step --> 2.098 | [2024-04-23 08:56:23,368][Main][INFO] - [train] Step 63900 out of 120000 | Loss --> 2.183 | Grad_l2 --> 0.387 | Weights_l2 --> 46298.156 | Lr --> 0.004 | Seconds_per_step --> 2.123 | [2024-04-23 08:59:55,317][Main][INFO] - [train] Step 64000 out of 120000 | Loss --> 2.182 | Grad_l2 --> 0.397 | Weights_l2 --> 46330.771 | Lr --> 0.004 | Seconds_per_step --> 2.119 | [2024-04-23 09:03:27,645][Main][INFO] - [train] Step 64100 out of 120000 | Loss --> 2.189 | Grad_l2 --> 0.390 | Weights_l2 --> 46363.195 | Lr --> 0.004 | Seconds_per_step --> 2.123 | [2024-04-23 09:06:59,940][Main][INFO] - [train] Step 64200 out of 120000 | Loss --> 2.210 | Grad_l2 --> 0.381 | Weights_l2 --> 46395.601 | Lr --> 0.004 | Seconds_per_step --> 2.123 | [2024-04-23 09:10:31,127][Main][INFO] - [train] Step 64300 out of 120000 | Loss --> 2.192 | Grad_l2 --> 0.380 | Weights_l2 --> 46428.182 | Lr --> 0.004 | Seconds_per_step --> 2.112 | [2024-04-23 09:14:01,993][Main][INFO] - [train] Step 64400 out of 120000 | Loss --> 2.177 | Grad_l2 --> 0.389 | Weights_l2 --> 46460.505 | Lr --> 0.004 | Seconds_per_step --> 2.109 | [2024-04-23 09:17:34,065][Main][INFO] - [train] Step 64500 out of 120000 | Loss --> 2.181 | Grad_l2 --> 0.399 | Weights_l2 --> 46492.438 | Lr --> 0.004 | Seconds_per_step --> 2.121 | [2024-04-23 09:21:06,894][Main][INFO] - [train] Step 64600 out of 120000 | Loss --> 2.190 | Grad_l2 --> 0.396 | Weights_l2 --> 46524.923 | Lr --> 0.004 | Seconds_per_step --> 2.128 | [2024-04-23 09:24:38,213][Main][INFO] - [train] Step 64700 out of 120000 | Loss --> 2.195 | Grad_l2 --> 0.382 | Weights_l2 --> 46557.356 | Lr --> 0.004 | Seconds_per_step --> 2.113 | [2024-04-23 09:28:11,066][Main][INFO] - [train] Step 64800 out of 120000 | Loss --> 2.197 | Grad_l2 --> 0.385 | Weights_l2 --> 46589.798 | Lr --> 0.004 | Seconds_per_step --> 2.129 | [2024-04-23 09:31:43,327][Main][INFO] - [train] Step 64900 out of 120000 | Loss --> 2.188 | Grad_l2 --> 0.385 | Weights_l2 --> 46622.120 | Lr --> 0.004 | Seconds_per_step --> 2.123 | [2024-04-23 09:35:13,974][Main][INFO] - [train] Step 65000 out of 120000 | Loss --> 2.205 | Grad_l2 --> 0.381 | Weights_l2 --> 46654.803 | Lr --> 0.004 | Seconds_per_step --> 2.106 | [2024-04-23 09:35:14,213][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-23 09:39:37,687][Main][INFO] - [eval] Step 65000 out of 120000 | Loss --> 2.047 | Accuracy --> 0.638 | Time --> 263.709 | [2024-04-23 09:43:09,939][Main][INFO] - [train] Step 65100 out of 120000 | Loss --> 2.209 | Grad_l2 --> 0.394 | Weights_l2 --> 46687.030 | Lr --> 0.004 | Seconds_per_step --> 2.123 | [2024-04-23 09:46:40,867][Main][INFO] - [train] Step 65200 out of 120000 | Loss --> 2.205 | Grad_l2 --> 0.384 | Weights_l2 --> 46719.496 | Lr --> 0.004 | Seconds_per_step --> 2.109 | [2024-04-23 09:50:12,164][Main][INFO] - [train] Step 65300 out of 120000 | Loss --> 2.193 | Grad_l2 --> 0.386 | Weights_l2 --> 46751.355 | Lr --> 0.004 | Seconds_per_step --> 2.113 | [2024-04-23 09:53:44,667][Main][INFO] - [train] Step 65400 out of 120000 | Loss --> 2.206 | Grad_l2 --> 0.380 | Weights_l2 --> 46783.189 | Lr --> 0.004 | Seconds_per_step --> 2.125 | [2024-04-23 09:57:15,973][Main][INFO] - [train] Step 65500 out of 120000 | Loss --> 2.208 | Grad_l2 --> 0.390 | Weights_l2 --> 46815.327 | Lr --> 0.004 | Seconds_per_step --> 2.113 | [2024-04-23 10:00:47,638][Main][INFO] - [train] Step 65600 out of 120000 | Loss --> 2.190 | Grad_l2 --> 0.381 | Weights_l2 --> 46846.799 | Lr --> 0.004 | Seconds_per_step --> 2.117 | [2024-04-23 10:04:20,729][Main][INFO] - [train] Step 65700 out of 120000 | Loss --> 2.193 | Grad_l2 --> 0.380 | Weights_l2 --> 46878.213 | Lr --> 0.004 | Seconds_per_step --> 2.131 | [2024-04-23 10:07:50,264][Main][INFO] - [train] Step 65800 out of 120000 | Loss --> 2.193 | Grad_l2 --> 0.385 | Weights_l2 --> 46909.939 | Lr --> 0.004 | Seconds_per_step --> 2.095 | [2024-04-23 10:11:22,872][Main][INFO] - [train] Step 65900 out of 120000 | Loss --> 2.174 | Grad_l2 --> 0.390 | Weights_l2 --> 46941.353 | Lr --> 0.004 | Seconds_per_step --> 2.126 | [2024-04-23 10:14:59,178][Main][INFO] - [train] Step 66000 out of 120000 | Loss --> 2.187 | Grad_l2 --> 0.387 | Weights_l2 --> 46972.847 | Lr --> 0.004 | Seconds_per_step --> 2.163 | [2024-04-23 10:18:29,052][Main][INFO] - [train] Step 66100 out of 120000 | Loss --> 2.173 | Grad_l2 --> 0.385 | Weights_l2 --> 47004.480 | Lr --> 0.004 | Seconds_per_step --> 2.099 | [2024-04-23 10:22:02,602][Main][INFO] - [train] Step 66200 out of 120000 | Loss --> 2.183 | Grad_l2 --> 0.380 | Weights_l2 --> 47036.224 | Lr --> 0.004 | Seconds_per_step --> 2.135 | [2024-04-23 10:25:32,106][Main][INFO] - [train] Step 66300 out of 120000 | Loss --> 2.205 | Grad_l2 --> 0.384 | Weights_l2 --> 47067.595 | Lr --> 0.004 | Seconds_per_step --> 2.095 | [2024-04-23 10:29:03,687][Main][INFO] - [train] Step 66400 out of 120000 | Loss --> 2.186 | Grad_l2 --> 0.383 | Weights_l2 --> 47099.039 | Lr --> 0.004 | Seconds_per_step --> 2.116 | [2024-04-23 10:32:35,386][Main][INFO] - [train] Step 66500 out of 120000 | Loss --> 2.191 | Grad_l2 --> 0.392 | Weights_l2 --> 47130.552 | Lr --> 0.004 | Seconds_per_step --> 2.117 | [2024-04-23 10:36:05,038][Main][INFO] - [train] Step 66600 out of 120000 | Loss --> 2.183 | Grad_l2 --> 0.378 | Weights_l2 --> 47161.843 | Lr --> 0.004 | Seconds_per_step --> 2.097 | [2024-04-23 10:39:41,160][Main][INFO] - [train] Step 66700 out of 120000 | Loss --> 2.201 | Grad_l2 --> 0.385 | Weights_l2 --> 47193.036 | Lr --> 0.004 | Seconds_per_step --> 2.161 | [2024-04-23 10:43:10,298][Main][INFO] - [train] Step 66800 out of 120000 | Loss --> 2.196 | Grad_l2 --> 0.387 | Weights_l2 --> 47224.823 | Lr --> 0.004 | Seconds_per_step --> 2.091 | [2024-04-23 10:46:41,183][Main][INFO] - [train] Step 66900 out of 120000 | Loss --> 2.171 | Grad_l2 --> 0.396 | Weights_l2 --> 47256.332 | Lr --> 0.004 | Seconds_per_step --> 2.109 | [2024-04-23 10:50:15,367][Main][INFO] - [train] Step 67000 out of 120000 | Loss --> 2.164 | Grad_l2 --> 0.382 | Weights_l2 --> 47287.780 | Lr --> 0.004 | Seconds_per_step --> 2.142 | [2024-04-23 10:53:49,339][Main][INFO] - [train] Step 67100 out of 120000 | Loss --> 2.184 | Grad_l2 --> 0.383 | Weights_l2 --> 47318.824 | Lr --> 0.004 | Seconds_per_step --> 2.140 | [2024-04-23 10:57:21,996][Main][INFO] - [train] Step 67200 out of 120000 | Loss --> 2.182 | Grad_l2 --> 0.395 | Weights_l2 --> 47350.675 | Lr --> 0.004 | Seconds_per_step --> 2.127 | [2024-04-23 11:00:53,156][Main][INFO] - [train] Step 67300 out of 120000 | Loss --> 2.173 | Grad_l2 --> 0.384 | Weights_l2 --> 47382.395 | Lr --> 0.004 | Seconds_per_step --> 2.112 | [2024-04-23 11:04:28,673][Main][INFO] - [train] Step 67400 out of 120000 | Loss --> 2.168 | Grad_l2 --> 0.384 | Weights_l2 --> 47414.297 | Lr --> 0.004 | Seconds_per_step --> 2.155 | [2024-04-23 11:07:57,320][Main][INFO] - [train] Step 67500 out of 120000 | Loss --> 2.176 | Grad_l2 --> 0.389 | Weights_l2 --> 47445.783 | Lr --> 0.004 | Seconds_per_step --> 2.086 | [2024-04-23 11:11:28,797][Main][INFO] - [train] Step 67600 out of 120000 | Loss --> 2.174 | Grad_l2 --> 0.390 | Weights_l2 --> 47477.360 | Lr --> 0.004 | Seconds_per_step --> 2.115 | [2024-04-23 11:15:00,700][Main][INFO] - [train] Step 67700 out of 120000 | Loss --> 2.161 | Grad_l2 --> 0.371 | Weights_l2 --> 47509.462 | Lr --> 0.004 | Seconds_per_step --> 2.119 | [2024-04-23 11:18:33,992][Main][INFO] - [train] Step 67800 out of 120000 | Loss --> 2.158 | Grad_l2 --> 0.387 | Weights_l2 --> 47541.046 | Lr --> 0.004 | Seconds_per_step --> 2.133 | [2024-04-23 11:22:08,567][Main][INFO] - [train] Step 67900 out of 120000 | Loss --> 2.172 | Grad_l2 --> 0.389 | Weights_l2 --> 47572.807 | Lr --> 0.004 | Seconds_per_step --> 2.146 | [2024-04-23 11:25:41,244][Main][INFO] - [train] Step 68000 out of 120000 | Loss --> 2.181 | Grad_l2 --> 0.385 | Weights_l2 --> 47604.404 | Lr --> 0.004 | Seconds_per_step --> 2.127 | [2024-04-23 11:29:13,039][Main][INFO] - [train] Step 68100 out of 120000 | Loss --> 2.180 | Grad_l2 --> 0.387 | Weights_l2 --> 47636.190 | Lr --> 0.004 | Seconds_per_step --> 2.118 | [2024-04-23 11:32:44,270][Main][INFO] - [train] Step 68200 out of 120000 | Loss --> 2.177 | Grad_l2 --> 0.384 | Weights_l2 --> 47667.473 | Lr --> 0.004 | Seconds_per_step --> 2.112 | [2024-04-23 11:36:15,849][Main][INFO] - [train] Step 68300 out of 120000 | Loss --> 2.172 | Grad_l2 --> 0.388 | Weights_l2 --> 47698.981 | Lr --> 0.004 | Seconds_per_step --> 2.116 | [2024-04-23 11:39:47,667][Main][INFO] - [train] Step 68400 out of 120000 | Loss --> 2.171 | Grad_l2 --> 0.399 | Weights_l2 --> 47730.324 | Lr --> 0.004 | Seconds_per_step --> 2.118 | [2024-04-23 11:43:19,265][Main][INFO] - [train] Step 68500 out of 120000 | Loss --> 2.182 | Grad_l2 --> 0.390 | Weights_l2 --> 47761.903 | Lr --> 0.004 | Seconds_per_step --> 2.116 | [2024-04-23 11:46:52,155][Main][INFO] - [train] Step 68600 out of 120000 | Loss --> 2.207 | Grad_l2 --> 0.393 | Weights_l2 --> 47793.013 | Lr --> 0.004 | Seconds_per_step --> 2.129 | [2024-04-23 11:50:22,938][Main][INFO] - [train] Step 68700 out of 120000 | Loss --> 2.184 | Grad_l2 --> 0.390 | Weights_l2 --> 47824.014 | Lr --> 0.004 | Seconds_per_step --> 2.108 | [2024-04-23 11:53:54,639][Main][INFO] - [train] Step 68800 out of 120000 | Loss --> 2.179 | Grad_l2 --> 0.387 | Weights_l2 --> 47854.702 | Lr --> 0.004 | Seconds_per_step --> 2.117 | [2024-04-23 11:57:23,838][Main][INFO] - [train] Step 68900 out of 120000 | Loss --> 2.188 | Grad_l2 --> 0.406 | Weights_l2 --> 47885.380 | Lr --> 0.004 | Seconds_per_step --> 2.092 | [2024-04-23 12:00:56,965][Main][INFO] - [train] Step 69000 out of 120000 | Loss --> 2.174 | Grad_l2 --> 0.379 | Weights_l2 --> 47916.620 | Lr --> 0.004 | Seconds_per_step --> 2.131 | [2024-04-23 12:04:27,792][Main][INFO] - [train] Step 69100 out of 120000 | Loss --> 2.171 | Grad_l2 --> 0.389 | Weights_l2 --> 47948.167 | Lr --> 0.004 | Seconds_per_step --> 2.108 | [2024-04-23 12:08:02,037][Main][INFO] - [train] Step 69200 out of 120000 | Loss --> 2.165 | Grad_l2 --> 0.388 | Weights_l2 --> 47978.874 | Lr --> 0.004 | Seconds_per_step --> 2.142 | [2024-04-23 12:11:33,265][Main][INFO] - [train] Step 69300 out of 120000 | Loss --> 2.177 | Grad_l2 --> 0.392 | Weights_l2 --> 48009.920 | Lr --> 0.004 | Seconds_per_step --> 2.112 | [2024-04-23 12:15:04,994][Main][INFO] - [train] Step 69400 out of 120000 | Loss --> 2.181 | Grad_l2 --> 0.387 | Weights_l2 --> 48040.483 | Lr --> 0.004 | Seconds_per_step --> 2.117 | [2024-04-23 12:18:37,439][Main][INFO] - [train] Step 69500 out of 120000 | Loss --> 2.189 | Grad_l2 --> 0.386 | Weights_l2 --> 48071.393 | Lr --> 0.004 | Seconds_per_step --> 2.124 | [2024-04-23 12:22:07,668][Main][INFO] - [train] Step 69600 out of 120000 | Loss --> 2.196 | Grad_l2 --> 0.400 | Weights_l2 --> 48102.355 | Lr --> 0.004 | Seconds_per_step --> 2.102 | [2024-04-23 12:25:41,087][Main][INFO] - [train] Step 69700 out of 120000 | Loss --> 2.191 | Grad_l2 --> 0.395 | Weights_l2 --> 48132.979 | Lr --> 0.004 | Seconds_per_step --> 2.134 | [2024-04-23 12:29:14,667][Main][INFO] - [train] Step 69800 out of 120000 | Loss --> 2.182 | Grad_l2 --> 0.394 | Weights_l2 --> 48163.791 | Lr --> 0.004 | Seconds_per_step --> 2.136 | [2024-04-23 12:32:46,546][Main][INFO] - [train] Step 69900 out of 120000 | Loss --> 2.190 | Grad_l2 --> 0.384 | Weights_l2 --> 48194.733 | Lr --> 0.004 | Seconds_per_step --> 2.119 | [2024-04-23 12:36:19,761][Main][INFO] - [train] Step 70000 out of 120000 | Loss --> 2.168 | Grad_l2 --> 0.384 | Weights_l2 --> 48225.582 | Lr --> 0.004 | Seconds_per_step --> 2.132 | [2024-04-23 12:36:19,985][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-23 12:40:42,395][Main][INFO] - [eval] Step 70000 out of 120000 | Loss --> 2.021 | Accuracy --> 0.640 | Time --> 262.632 | [2024-04-23 12:40:42,399][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-70000 [2024-04-23 12:40:42,402][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-04-23 12:40:47,096][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-70000/model.safetensors [2024-04-23 12:40:47,150][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-70000/optimizer.bin [2024-04-23 12:40:47,151][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-70000/scheduler.bin [2024-04-23 12:40:47,151][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-70000/sampler.bin [2024-04-23 12:40:47,151][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-70000/sampler_1.bin [2024-04-23 12:40:47,152][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-70000/random_states_0.pkl [2024-04-23 12:44:15,837][Main][INFO] - [train] Step 70100 out of 120000 | Loss --> 2.169 | Grad_l2 --> 0.377 | Weights_l2 --> 48255.834 | Lr --> 0.004 | Seconds_per_step --> 2.134 | [2024-04-23 12:47:46,491][Main][INFO] - [train] Step 70200 out of 120000 | Loss --> 2.144 | Grad_l2 --> 0.379 | Weights_l2 --> 48286.304 | Lr --> 0.004 | Seconds_per_step --> 2.107 | [2024-04-23 12:51:20,979][Main][INFO] - [train] Step 70300 out of 120000 | Loss --> 2.153 | Grad_l2 --> 0.390 | Weights_l2 --> 48316.872 | Lr --> 0.004 | Seconds_per_step --> 2.145 | [2024-04-23 12:54:52,203][Main][INFO] - [train] Step 70400 out of 120000 | Loss --> 2.178 | Grad_l2 --> 0.418 | Weights_l2 --> 48347.304 | Lr --> 0.004 | Seconds_per_step --> 2.112 | [2024-04-23 12:58:27,177][Main][INFO] - [train] Step 70500 out of 120000 | Loss --> 2.175 | Grad_l2 --> 0.385 | Weights_l2 --> 48377.926 | Lr --> 0.004 | Seconds_per_step --> 2.150 | [2024-04-23 13:01:56,037][Main][INFO] - [train] Step 70600 out of 120000 | Loss --> 2.182 | Grad_l2 --> 0.396 | Weights_l2 --> 48408.624 | Lr --> 0.004 | Seconds_per_step --> 2.089 | [2024-04-23 13:05:30,738][Main][INFO] - [train] Step 70700 out of 120000 | Loss --> 2.166 | Grad_l2 --> 0.384 | Weights_l2 --> 48438.915 | Lr --> 0.004 | Seconds_per_step --> 2.147 | [2024-04-23 13:09:01,146][Main][INFO] - [train] Step 70800 out of 120000 | Loss --> 2.169 | Grad_l2 --> 0.389 | Weights_l2 --> 48469.532 | Lr --> 0.004 | Seconds_per_step --> 2.104 | [2024-04-23 13:12:33,851][Main][INFO] - [train] Step 70900 out of 120000 | Loss --> 2.180 | Grad_l2 --> 0.404 | Weights_l2 --> 48500.078 | Lr --> 0.004 | Seconds_per_step --> 2.127 | [2024-04-23 13:16:09,341][Main][INFO] - [train] Step 71000 out of 120000 | Loss --> 2.181 | Grad_l2 --> 0.410 | Weights_l2 --> 48530.661 | Lr --> 0.004 | Seconds_per_step --> 2.155 | [2024-04-23 13:19:36,041][Main][INFO] - [train] Step 71100 out of 120000 | Loss --> 2.164 | Grad_l2 --> 0.395 | Weights_l2 --> 48561.446 | Lr --> 0.004 | Seconds_per_step --> 2.067 | [2024-04-23 13:23:07,439][Main][INFO] - [train] Step 71200 out of 120000 | Loss --> 2.175 | Grad_l2 --> 0.410 | Weights_l2 --> 48592.025 | Lr --> 0.004 | Seconds_per_step --> 2.114 | [2024-04-23 13:26:42,892][Main][INFO] - [train] Step 71300 out of 120000 | Loss --> 2.181 | Grad_l2 --> 0.407 | Weights_l2 --> 48622.396 | Lr --> 0.004 | Seconds_per_step --> 2.155 | [2024-04-23 13:30:10,638][Main][INFO] - [train] Step 71400 out of 120000 | Loss --> 2.188 | Grad_l2 --> 0.389 | Weights_l2 --> 48652.660 | Lr --> 0.004 | Seconds_per_step --> 2.077 | [2024-04-23 13:33:43,701][Main][INFO] - [train] Step 71500 out of 120000 | Loss --> 2.191 | Grad_l2 --> 0.389 | Weights_l2 --> 48682.788 | Lr --> 0.004 | Seconds_per_step --> 2.131 | [2024-04-23 13:37:18,303][Main][INFO] - [train] Step 71600 out of 120000 | Loss --> 2.187 | Grad_l2 --> 0.398 | Weights_l2 --> 48712.992 | Lr --> 0.004 | Seconds_per_step --> 2.146 | [2024-04-23 13:40:50,098][Main][INFO] - [train] Step 71700 out of 120000 | Loss --> 2.177 | Grad_l2 --> 0.384 | Weights_l2 --> 48743.447 | Lr --> 0.004 | Seconds_per_step --> 2.118 | [2024-04-23 13:44:24,039][Main][INFO] - [train] Step 71800 out of 120000 | Loss --> 2.177 | Grad_l2 --> 0.390 | Weights_l2 --> 48773.458 | Lr --> 0.004 | Seconds_per_step --> 2.139 | [2024-04-23 13:47:57,394][Main][INFO] - [train] Step 71900 out of 120000 | Loss --> 2.168 | Grad_l2 --> 0.394 | Weights_l2 --> 48803.696 | Lr --> 0.004 | Seconds_per_step --> 2.134 | [2024-04-23 13:51:27,979][Main][INFO] - [train] Step 72000 out of 120000 | Loss --> 2.152 | Grad_l2 --> 0.392 | Weights_l2 --> 48833.767 | Lr --> 0.004 | Seconds_per_step --> 2.106 | [2024-04-23 13:54:57,537][Main][INFO] - [train] Step 72100 out of 120000 | Loss --> 2.161 | Grad_l2 --> 0.391 | Weights_l2 --> 48864.719 | Lr --> 0.004 | Seconds_per_step --> 2.096 | [2024-04-23 13:58:28,082][Main][INFO] - [train] Step 72200 out of 120000 | Loss --> 2.164 | Grad_l2 --> 0.390 | Weights_l2 --> 48895.076 | Lr --> 0.004 | Seconds_per_step --> 2.105 | [2024-04-23 14:01:59,143][Main][INFO] - [train] Step 72300 out of 120000 | Loss --> 2.156 | Grad_l2 --> 0.397 | Weights_l2 --> 48925.062 | Lr --> 0.004 | Seconds_per_step --> 2.111 | [2024-04-23 14:05:31,076][Main][INFO] - [train] Step 72400 out of 120000 | Loss --> 2.158 | Grad_l2 --> 0.382 | Weights_l2 --> 48955.089 | Lr --> 0.004 | Seconds_per_step --> 2.119 | [2024-04-23 14:09:03,966][Main][INFO] - [train] Step 72500 out of 120000 | Loss --> 2.161 | Grad_l2 --> 0.397 | Weights_l2 --> 48985.020 | Lr --> 0.004 | Seconds_per_step --> 2.129 | [2024-04-23 14:12:35,137][Main][INFO] - [train] Step 72600 out of 120000 | Loss --> 2.171 | Grad_l2 --> 0.386 | Weights_l2 --> 49014.992 | Lr --> 0.004 | Seconds_per_step --> 2.112 | [2024-04-23 14:16:04,904][Main][INFO] - [train] Step 72700 out of 120000 | Loss --> 2.160 | Grad_l2 --> 0.394 | Weights_l2 --> 49045.107 | Lr --> 0.004 | Seconds_per_step --> 2.098 | [2024-04-23 14:19:35,766][Main][INFO] - [train] Step 72800 out of 120000 | Loss --> 2.161 | Grad_l2 --> 0.388 | Weights_l2 --> 49075.003 | Lr --> 0.004 | Seconds_per_step --> 2.109 | [2024-04-23 14:23:07,368][Main][INFO] - [train] Step 72900 out of 120000 | Loss --> 2.157 | Grad_l2 --> 0.387 | Weights_l2 --> 49105.174 | Lr --> 0.004 | Seconds_per_step --> 2.116 | [2024-04-23 14:26:38,536][Main][INFO] - [train] Step 73000 out of 120000 | Loss --> 2.162 | Grad_l2 --> 0.397 | Weights_l2 --> 49135.373 | Lr --> 0.004 | Seconds_per_step --> 2.112 | [2024-04-23 14:30:08,476][Main][INFO] - [train] Step 73100 out of 120000 | Loss --> 2.149 | Grad_l2 --> 0.387 | Weights_l2 --> 49165.791 | Lr --> 0.004 | Seconds_per_step --> 2.099 | [2024-04-23 14:33:40,567][Main][INFO] - [train] Step 73200 out of 120000 | Loss --> 2.155 | Grad_l2 --> 0.385 | Weights_l2 --> 49196.107 | Lr --> 0.004 | Seconds_per_step --> 2.121 | [2024-04-23 14:37:12,092][Main][INFO] - [train] Step 73300 out of 120000 | Loss --> 2.146 | Grad_l2 --> 0.392 | Weights_l2 --> 49225.876 | Lr --> 0.004 | Seconds_per_step --> 2.115 | [2024-04-23 14:40:45,267][Main][INFO] - [train] Step 73400 out of 120000 | Loss --> 2.173 | Grad_l2 --> 0.391 | Weights_l2 --> 49256.388 | Lr --> 0.004 | Seconds_per_step --> 2.132 | [2024-04-23 14:44:16,596][Main][INFO] - [train] Step 73500 out of 120000 | Loss --> 2.162 | Grad_l2 --> 0.385 | Weights_l2 --> 49285.956 | Lr --> 0.004 | Seconds_per_step --> 2.113 | [2024-04-23 14:47:46,997][Main][INFO] - [train] Step 73600 out of 120000 | Loss --> 2.154 | Grad_l2 --> 0.388 | Weights_l2 --> 49315.738 | Lr --> 0.004 | Seconds_per_step --> 2.104 | [2024-04-23 14:51:18,547][Main][INFO] - [train] Step 73700 out of 120000 | Loss --> 2.150 | Grad_l2 --> 0.388 | Weights_l2 --> 49345.740 | Lr --> 0.004 | Seconds_per_step --> 2.115 | [2024-04-23 14:54:52,144][Main][INFO] - [train] Step 73800 out of 120000 | Loss --> 2.149 | Grad_l2 --> 0.390 | Weights_l2 --> 49375.993 | Lr --> 0.004 | Seconds_per_step --> 2.136 | [2024-04-23 14:58:23,567][Main][INFO] - [train] Step 73900 out of 120000 | Loss --> 2.153 | Grad_l2 --> 0.387 | Weights_l2 --> 49406.041 | Lr --> 0.004 | Seconds_per_step --> 2.114 | [2024-04-23 15:01:54,366][Main][INFO] - [train] Step 74000 out of 120000 | Loss --> 2.139 | Grad_l2 --> 0.389 | Weights_l2 --> 49436.062 | Lr --> 0.004 | Seconds_per_step --> 2.108 | [2024-04-23 15:05:29,568][Main][INFO] - [train] Step 74100 out of 120000 | Loss --> 2.158 | Grad_l2 --> 0.378 | Weights_l2 --> 49466.066 | Lr --> 0.004 | Seconds_per_step --> 2.152 | [2024-04-23 15:08:59,895][Main][INFO] - [train] Step 74200 out of 120000 | Loss --> 2.165 | Grad_l2 --> 0.394 | Weights_l2 --> 49496.429 | Lr --> 0.004 | Seconds_per_step --> 2.103 | [2024-04-23 15:12:35,039][Main][INFO] - [train] Step 74300 out of 120000 | Loss --> 2.146 | Grad_l2 --> 0.393 | Weights_l2 --> 49526.500 | Lr --> 0.004 | Seconds_per_step --> 2.151 | [2024-04-23 15:16:05,237][Main][INFO] - [train] Step 74400 out of 120000 | Loss --> 2.149 | Grad_l2 --> 0.389 | Weights_l2 --> 49556.161 | Lr --> 0.004 | Seconds_per_step --> 2.102 | [2024-04-23 15:19:37,099][Main][INFO] - [train] Step 74500 out of 120000 | Loss --> 2.143 | Grad_l2 --> 0.403 | Weights_l2 --> 49586.082 | Lr --> 0.004 | Seconds_per_step --> 2.119 | [2024-04-23 15:23:08,564][Main][INFO] - [train] Step 74600 out of 120000 | Loss --> 2.152 | Grad_l2 --> 0.391 | Weights_l2 --> 49616.120 | Lr --> 0.004 | Seconds_per_step --> 2.115 | [2024-04-23 15:26:39,187][Main][INFO] - [train] Step 74700 out of 120000 | Loss --> 2.141 | Grad_l2 --> 0.388 | Weights_l2 --> 49645.908 | Lr --> 0.004 | Seconds_per_step --> 2.106 | [2024-04-23 15:30:09,206][Main][INFO] - [train] Step 74800 out of 120000 | Loss --> 2.146 | Grad_l2 --> 0.383 | Weights_l2 --> 49675.306 | Lr --> 0.004 | Seconds_per_step --> 2.100 | [2024-04-23 15:33:39,996][Main][INFO] - [train] Step 74900 out of 120000 | Loss --> 2.148 | Grad_l2 --> 0.399 | Weights_l2 --> 49705.091 | Lr --> 0.004 | Seconds_per_step --> 2.108 | [2024-04-23 15:37:10,368][Main][INFO] - [train] Step 75000 out of 120000 | Loss --> 2.160 | Grad_l2 --> 0.381 | Weights_l2 --> 49734.718 | Lr --> 0.004 | Seconds_per_step --> 2.104 | [2024-04-23 15:37:10,596][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-23 15:41:34,808][Main][INFO] - [eval] Step 75000 out of 120000 | Loss --> 1.997 | Accuracy --> 0.644 | Time --> 264.438 | [2024-04-23 15:45:05,893][Main][INFO] - [train] Step 75100 out of 120000 | Loss --> 2.150 | Grad_l2 --> 0.394 | Weights_l2 --> 49764.537 | Lr --> 0.004 | Seconds_per_step --> 2.111 | [2024-04-23 15:48:38,741][Main][INFO] - [train] Step 75200 out of 120000 | Loss --> 2.165 | Grad_l2 --> 0.388 | Weights_l2 --> 49794.561 | Lr --> 0.004 | Seconds_per_step --> 2.128 | [2024-04-23 15:52:10,948][Main][INFO] - [train] Step 75300 out of 120000 | Loss --> 2.164 | Grad_l2 --> 0.396 | Weights_l2 --> 49824.512 | Lr --> 0.004 | Seconds_per_step --> 2.122 | [2024-04-23 15:55:44,068][Main][INFO] - [train] Step 75400 out of 120000 | Loss --> 2.153 | Grad_l2 --> 0.400 | Weights_l2 --> 49853.889 | Lr --> 0.004 | Seconds_per_step --> 2.131 | [2024-04-23 15:59:16,436][Main][INFO] - [train] Step 75500 out of 120000 | Loss --> 2.168 | Grad_l2 --> 0.388 | Weights_l2 --> 49884.072 | Lr --> 0.004 | Seconds_per_step --> 2.124 | [2024-04-23 16:02:47,939][Main][INFO] - [train] Step 75600 out of 120000 | Loss --> 2.161 | Grad_l2 --> 0.391 | Weights_l2 --> 49913.632 | Lr --> 0.004 | Seconds_per_step --> 2.115 | [2024-04-23 16:06:19,700][Main][INFO] - [train] Step 75700 out of 120000 | Loss --> 2.149 | Grad_l2 --> 0.388 | Weights_l2 --> 49943.037 | Lr --> 0.004 | Seconds_per_step --> 2.118 | [2024-04-23 16:09:52,820][Main][INFO] - [train] Step 75800 out of 120000 | Loss --> 2.131 | Grad_l2 --> 0.381 | Weights_l2 --> 49972.559 | Lr --> 0.004 | Seconds_per_step --> 2.131 | [2024-04-23 16:13:24,599][Main][INFO] - [train] Step 75900 out of 120000 | Loss --> 2.142 | Grad_l2 --> 0.396 | Weights_l2 --> 50002.472 | Lr --> 0.004 | Seconds_per_step --> 2.118 | [2024-04-23 16:16:58,038][Main][INFO] - [train] Step 76000 out of 120000 | Loss --> 2.153 | Grad_l2 --> 0.383 | Weights_l2 --> 50032.049 | Lr --> 0.004 | Seconds_per_step --> 2.134 | [2024-04-23 16:20:28,436][Main][INFO] - [train] Step 76100 out of 120000 | Loss --> 2.157 | Grad_l2 --> 0.391 | Weights_l2 --> 50061.742 | Lr --> 0.004 | Seconds_per_step --> 2.104 | [2024-04-23 16:24:01,182][Main][INFO] - [train] Step 76200 out of 120000 | Loss --> 2.156 | Grad_l2 --> 0.385 | Weights_l2 --> 50091.375 | Lr --> 0.004 | Seconds_per_step --> 2.127 | [2024-04-23 16:27:32,905][Main][INFO] - [train] Step 76300 out of 120000 | Loss --> 2.146 | Grad_l2 --> 0.388 | Weights_l2 --> 50120.625 | Lr --> 0.004 | Seconds_per_step --> 2.117 | [2024-04-23 16:31:01,247][Main][INFO] - [train] Step 76400 out of 120000 | Loss --> 2.154 | Grad_l2 --> 0.378 | Weights_l2 --> 50149.948 | Lr --> 0.004 | Seconds_per_step --> 2.083 | [2024-04-23 16:34:35,593][Main][INFO] - [train] Step 76500 out of 120000 | Loss --> 2.151 | Grad_l2 --> 0.414 | Weights_l2 --> 50179.306 | Lr --> 0.004 | Seconds_per_step --> 2.143 | [2024-04-23 16:38:03,584][Main][INFO] - [train] Step 76600 out of 120000 | Loss --> 2.146 | Grad_l2 --> 0.393 | Weights_l2 --> 50208.887 | Lr --> 0.004 | Seconds_per_step --> 2.080 | [2024-04-23 16:41:35,492][Main][INFO] - [train] Step 76700 out of 120000 | Loss --> 2.139 | Grad_l2 --> 0.389 | Weights_l2 --> 50237.967 | Lr --> 0.004 | Seconds_per_step --> 2.119 | [2024-04-23 16:45:09,418][Main][INFO] - [train] Step 76800 out of 120000 | Loss --> 2.159 | Grad_l2 --> 0.395 | Weights_l2 --> 50267.136 | Lr --> 0.004 | Seconds_per_step --> 2.139 | [2024-04-23 16:48:40,784][Main][INFO] - [train] Step 76900 out of 120000 | Loss --> 2.156 | Grad_l2 --> 0.384 | Weights_l2 --> 50297.254 | Lr --> 0.004 | Seconds_per_step --> 2.114 | [2024-04-23 16:52:13,376][Main][INFO] - [train] Step 77000 out of 120000 | Loss --> 2.154 | Grad_l2 --> 0.385 | Weights_l2 --> 50326.774 | Lr --> 0.004 | Seconds_per_step --> 2.126 | [2024-04-23 16:55:44,796][Main][INFO] - [train] Step 77100 out of 120000 | Loss --> 2.159 | Grad_l2 --> 0.396 | Weights_l2 --> 50356.262 | Lr --> 0.004 | Seconds_per_step --> 2.114 | [2024-04-23 16:59:18,839][Main][INFO] - [train] Step 77200 out of 120000 | Loss --> 2.159 | Grad_l2 --> 0.394 | Weights_l2 --> 50385.547 | Lr --> 0.004 | Seconds_per_step --> 2.140 | [2024-04-23 17:02:50,038][Main][INFO] - [train] Step 77300 out of 120000 | Loss --> 2.157 | Grad_l2 --> 0.399 | Weights_l2 --> 50415.002 | Lr --> 0.004 | Seconds_per_step --> 2.112 | [2024-04-23 17:06:22,845][Main][INFO] - [train] Step 77400 out of 120000 | Loss --> 2.152 | Grad_l2 --> 0.400 | Weights_l2 --> 50444.592 | Lr --> 0.004 | Seconds_per_step --> 2.128 | [2024-04-23 17:09:56,240][Main][INFO] - [train] Step 77500 out of 120000 | Loss --> 2.155 | Grad_l2 --> 0.391 | Weights_l2 --> 50474.060 | Lr --> 0.004 | Seconds_per_step --> 2.134 | [2024-04-23 17:13:24,841][Main][INFO] - [train] Step 77600 out of 120000 | Loss --> 2.151 | Grad_l2 --> 0.395 | Weights_l2 --> 50503.372 | Lr --> 0.004 | Seconds_per_step --> 2.086 | [2024-04-23 17:17:00,010][Main][INFO] - [train] Step 77700 out of 120000 | Loss --> 2.159 | Grad_l2 --> 0.392 | Weights_l2 --> 50532.687 | Lr --> 0.004 | Seconds_per_step --> 2.151 | [2024-04-23 17:20:31,738][Main][INFO] - [train] Step 77800 out of 120000 | Loss --> 2.152 | Grad_l2 --> 0.397 | Weights_l2 --> 50561.892 | Lr --> 0.004 | Seconds_per_step --> 2.117 | [2024-04-23 17:24:07,538][Main][INFO] - [train] Step 77900 out of 120000 | Loss --> 2.147 | Grad_l2 --> 0.384 | Weights_l2 --> 50591.612 | Lr --> 0.004 | Seconds_per_step --> 2.158 | [2024-04-23 17:27:39,236][Main][INFO] - [train] Step 78000 out of 120000 | Loss --> 2.151 | Grad_l2 --> 0.383 | Weights_l2 --> 50620.713 | Lr --> 0.004 | Seconds_per_step --> 2.117 | [2024-04-23 17:31:10,686][Main][INFO] - [train] Step 78100 out of 120000 | Loss --> 2.139 | Grad_l2 --> 0.390 | Weights_l2 --> 50650.057 | Lr --> 0.004 | Seconds_per_step --> 2.114 | [2024-04-23 17:34:43,738][Main][INFO] - [train] Step 78200 out of 120000 | Loss --> 2.149 | Grad_l2 --> 0.412 | Weights_l2 --> 50679.249 | Lr --> 0.004 | Seconds_per_step --> 2.131 | [2024-04-23 17:38:14,339][Main][INFO] - [train] Step 78300 out of 120000 | Loss --> 2.128 | Grad_l2 --> 0.395 | Weights_l2 --> 50708.876 | Lr --> 0.004 | Seconds_per_step --> 2.106 | [2024-04-23 17:41:45,785][Main][INFO] - [train] Step 78400 out of 120000 | Loss --> 2.132 | Grad_l2 --> 0.386 | Weights_l2 --> 50738.177 | Lr --> 0.004 | Seconds_per_step --> 2.114 | [2024-04-23 17:45:17,339][Main][INFO] - [train] Step 78500 out of 120000 | Loss --> 2.118 | Grad_l2 --> 0.395 | Weights_l2 --> 50767.414 | Lr --> 0.004 | Seconds_per_step --> 2.116 | [2024-04-23 17:48:50,280][Main][INFO] - [train] Step 78600 out of 120000 | Loss --> 2.130 | Grad_l2 --> 0.409 | Weights_l2 --> 50796.347 | Lr --> 0.004 | Seconds_per_step --> 2.129 | [2024-04-23 17:52:22,284][Main][INFO] - [train] Step 78700 out of 120000 | Loss --> 2.103 | Grad_l2 --> 0.380 | Weights_l2 --> 50825.326 | Lr --> 0.004 | Seconds_per_step --> 2.120 | [2024-04-23 17:55:51,396][Main][INFO] - [train] Step 78800 out of 120000 | Loss --> 2.106 | Grad_l2 --> 0.390 | Weights_l2 --> 50854.551 | Lr --> 0.004 | Seconds_per_step --> 2.091 | [2024-04-23 17:59:21,938][Main][INFO] - [train] Step 78900 out of 120000 | Loss --> 2.104 | Grad_l2 --> 0.400 | Weights_l2 --> 50883.766 | Lr --> 0.004 | Seconds_per_step --> 2.105 | [2024-04-23 18:02:53,039][Main][INFO] - [train] Step 79000 out of 120000 | Loss --> 2.111 | Grad_l2 --> 0.386 | Weights_l2 --> 50912.256 | Lr --> 0.004 | Seconds_per_step --> 2.111 | [2024-04-23 18:06:24,566][Main][INFO] - [train] Step 79100 out of 120000 | Loss --> 2.110 | Grad_l2 --> 0.380 | Weights_l2 --> 50941.239 | Lr --> 0.004 | Seconds_per_step --> 2.115 | [2024-04-23 18:09:57,764][Main][INFO] - [train] Step 79200 out of 120000 | Loss --> 2.120 | Grad_l2 --> 0.403 | Weights_l2 --> 50969.847 | Lr --> 0.004 | Seconds_per_step --> 2.132 | [2024-04-23 18:13:31,443][Main][INFO] - [train] Step 79300 out of 120000 | Loss --> 2.118 | Grad_l2 --> 0.398 | Weights_l2 --> 50998.570 | Lr --> 0.004 | Seconds_per_step --> 2.137 | [2024-04-23 18:17:02,181][Main][INFO] - [train] Step 79400 out of 120000 | Loss --> 2.111 | Grad_l2 --> 0.389 | Weights_l2 --> 51027.513 | Lr --> 0.004 | Seconds_per_step --> 2.107 | [2024-04-23 18:20:30,938][Main][INFO] - [train] Step 79500 out of 120000 | Loss --> 2.128 | Grad_l2 --> 0.395 | Weights_l2 --> 51056.616 | Lr --> 0.004 | Seconds_per_step --> 2.088 | [2024-04-23 18:24:04,866][Main][INFO] - [train] Step 79600 out of 120000 | Loss --> 2.136 | Grad_l2 --> 0.392 | Weights_l2 --> 51085.584 | Lr --> 0.004 | Seconds_per_step --> 2.139 | [2024-04-23 18:27:36,907][Main][INFO] - [train] Step 79700 out of 120000 | Loss --> 2.139 | Grad_l2 --> 0.395 | Weights_l2 --> 51114.438 | Lr --> 0.004 | Seconds_per_step --> 2.120 | [2024-04-23 18:31:07,668][Main][INFO] - [train] Step 79800 out of 120000 | Loss --> 2.121 | Grad_l2 --> 0.388 | Weights_l2 --> 51143.372 | Lr --> 0.004 | Seconds_per_step --> 2.108 | [2024-04-23 18:34:41,085][Main][INFO] - [train] Step 79900 out of 120000 | Loss --> 2.146 | Grad_l2 --> 0.399 | Weights_l2 --> 51172.301 | Lr --> 0.004 | Seconds_per_step --> 2.134 | [2024-04-23 18:38:12,100][Main][INFO] - [train] Step 80000 out of 120000 | Loss --> 2.098 | Grad_l2 --> 0.396 | Weights_l2 --> 51201.345 | Lr --> 0.004 | Seconds_per_step --> 2.110 | [2024-04-23 18:38:12,508][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-23 18:42:36,494][Main][INFO] - [eval] Step 80000 out of 120000 | Loss --> 1.979 | Accuracy --> 0.646 | Time --> 264.392 | [2024-04-23 18:42:36,498][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-80000 [2024-04-23 18:42:36,501][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-04-23 18:42:41,217][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-80000/model.safetensors [2024-04-23 18:42:41,268][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-80000/optimizer.bin [2024-04-23 18:42:41,269][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-80000/scheduler.bin [2024-04-23 18:42:41,269][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-80000/sampler.bin [2024-04-23 18:42:41,269][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-80000/sampler_1.bin [2024-04-23 18:42:41,271][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-80000/random_states_0.pkl [2024-04-23 18:46:13,785][Main][INFO] - [train] Step 80100 out of 120000 | Loss --> 2.116 | Grad_l2 --> 0.398 | Weights_l2 --> 51229.899 | Lr --> 0.004 | Seconds_per_step --> 2.173 | [2024-04-23 18:49:45,366][Main][INFO] - [train] Step 80200 out of 120000 | Loss --> 2.120 | Grad_l2 --> 0.385 | Weights_l2 --> 51258.525 | Lr --> 0.004 | Seconds_per_step --> 2.116 | [2024-04-23 18:53:15,682][Main][INFO] - [train] Step 80300 out of 120000 | Loss --> 2.117 | Grad_l2 --> 0.391 | Weights_l2 --> 51287.117 | Lr --> 0.004 | Seconds_per_step --> 2.103 | [2024-04-23 18:56:47,998][Main][INFO] - [train] Step 80400 out of 120000 | Loss --> 2.120 | Grad_l2 --> 0.400 | Weights_l2 --> 51315.979 | Lr --> 0.004 | Seconds_per_step --> 2.123 | [2024-04-23 19:00:17,194][Main][INFO] - [train] Step 80500 out of 120000 | Loss --> 2.108 | Grad_l2 --> 0.452 | Weights_l2 --> 51344.650 | Lr --> 0.004 | Seconds_per_step --> 2.092 | [2024-04-23 19:03:51,393][Main][INFO] - [train] Step 80600 out of 120000 | Loss --> 2.135 | Grad_l2 --> 0.396 | Weights_l2 --> 51373.149 | Lr --> 0.004 | Seconds_per_step --> 2.142 | [2024-04-23 19:07:20,144][Main][INFO] - [train] Step 80700 out of 120000 | Loss --> 2.118 | Grad_l2 --> 0.393 | Weights_l2 --> 51401.946 | Lr --> 0.004 | Seconds_per_step --> 2.088 | [2024-04-23 19:10:51,238][Main][INFO] - [train] Step 80800 out of 120000 | Loss --> 2.124 | Grad_l2 --> 0.391 | Weights_l2 --> 51430.781 | Lr --> 0.004 | Seconds_per_step --> 2.111 | [2024-04-23 19:14:25,293][Main][INFO] - [train] Step 80900 out of 120000 | Loss --> 2.130 | Grad_l2 --> 0.392 | Weights_l2 --> 51459.423 | Lr --> 0.004 | Seconds_per_step --> 2.141 | [2024-04-23 19:17:53,910][Main][INFO] - [train] Step 81000 out of 120000 | Loss --> 2.119 | Grad_l2 --> 0.397 | Weights_l2 --> 51488.321 | Lr --> 0.004 | Seconds_per_step --> 2.086 | [2024-04-23 19:21:26,802][Main][INFO] - [train] Step 81100 out of 120000 | Loss --> 2.127 | Grad_l2 --> 0.389 | Weights_l2 --> 51516.638 | Lr --> 0.004 | Seconds_per_step --> 2.129 | [2024-04-23 19:24:59,277][Main][INFO] - [train] Step 81200 out of 120000 | Loss --> 2.109 | Grad_l2 --> 0.391 | Weights_l2 --> 51544.674 | Lr --> 0.004 | Seconds_per_step --> 2.125 | [2024-04-23 19:28:29,669][Main][INFO] - [train] Step 81300 out of 120000 | Loss --> 2.114 | Grad_l2 --> 0.392 | Weights_l2 --> 51573.397 | Lr --> 0.004 | Seconds_per_step --> 2.104 | [2024-04-23 19:32:05,375][Main][INFO] - [train] Step 81400 out of 120000 | Loss --> 2.121 | Grad_l2 --> 0.395 | Weights_l2 --> 51601.348 | Lr --> 0.004 | Seconds_per_step --> 2.157 | [2024-04-23 19:35:32,667][Main][INFO] - [train] Step 81500 out of 120000 | Loss --> 2.113 | Grad_l2 --> 0.391 | Weights_l2 --> 51629.636 | Lr --> 0.004 | Seconds_per_step --> 2.073 | [2024-04-23 19:39:03,839][Main][INFO] - [train] Step 81600 out of 120000 | Loss --> 2.122 | Grad_l2 --> 0.389 | Weights_l2 --> 51657.988 | Lr --> 0.004 | Seconds_per_step --> 2.112 | [2024-04-23 19:42:35,850][Main][INFO] - [train] Step 81700 out of 120000 | Loss --> 2.129 | Grad_l2 --> 0.385 | Weights_l2 --> 51685.750 | Lr --> 0.003 | Seconds_per_step --> 2.120 | [2024-04-23 19:46:08,381][Main][INFO] - [train] Step 81800 out of 120000 | Loss --> 2.140 | Grad_l2 --> 0.397 | Weights_l2 --> 51713.888 | Lr --> 0.003 | Seconds_per_step --> 2.125 | [2024-04-23 19:49:40,299][Main][INFO] - [train] Step 81900 out of 120000 | Loss --> 2.122 | Grad_l2 --> 0.386 | Weights_l2 --> 51741.796 | Lr --> 0.003 | Seconds_per_step --> 2.119 | [2024-04-23 19:53:14,101][Main][INFO] - [train] Step 82000 out of 120000 | Loss --> 2.130 | Grad_l2 --> 0.386 | Weights_l2 --> 51769.768 | Lr --> 0.003 | Seconds_per_step --> 2.138 | [2024-04-23 19:56:46,283][Main][INFO] - [train] Step 82100 out of 120000 | Loss --> 2.116 | Grad_l2 --> 0.396 | Weights_l2 --> 51797.982 | Lr --> 0.003 | Seconds_per_step --> 2.122 | [2024-04-23 20:00:19,100][Main][INFO] - [train] Step 82200 out of 120000 | Loss --> 2.136 | Grad_l2 --> 0.403 | Weights_l2 --> 51826.243 | Lr --> 0.003 | Seconds_per_step --> 2.128 | [2024-04-23 20:03:51,101][Main][INFO] - [train] Step 82300 out of 120000 | Loss --> 2.149 | Grad_l2 --> 0.395 | Weights_l2 --> 51854.724 | Lr --> 0.003 | Seconds_per_step --> 2.120 | [2024-04-23 20:07:22,966][Main][INFO] - [train] Step 82400 out of 120000 | Loss --> 2.139 | Grad_l2 --> 0.402 | Weights_l2 --> 51882.828 | Lr --> 0.003 | Seconds_per_step --> 2.119 | [2024-04-23 20:10:53,937][Main][INFO] - [train] Step 82500 out of 120000 | Loss --> 2.135 | Grad_l2 --> 0.398 | Weights_l2 --> 51910.625 | Lr --> 0.003 | Seconds_per_step --> 2.110 | [2024-04-23 20:14:26,281][Main][INFO] - [train] Step 82600 out of 120000 | Loss --> 2.125 | Grad_l2 --> 0.395 | Weights_l2 --> 51938.968 | Lr --> 0.003 | Seconds_per_step --> 2.123 | [2024-04-23 20:17:58,270][Main][INFO] - [train] Step 82700 out of 120000 | Loss --> 2.134 | Grad_l2 --> 0.402 | Weights_l2 --> 51966.711 | Lr --> 0.003 | Seconds_per_step --> 2.120 | [2024-04-23 20:21:29,402][Main][INFO] - [train] Step 82800 out of 120000 | Loss --> 2.132 | Grad_l2 --> 0.389 | Weights_l2 --> 51994.536 | Lr --> 0.003 | Seconds_per_step --> 2.111 | [2024-04-23 20:25:03,195][Main][INFO] - [train] Step 82900 out of 120000 | Loss --> 2.128 | Grad_l2 --> 0.392 | Weights_l2 --> 52022.429 | Lr --> 0.003 | Seconds_per_step --> 2.138 | [2024-04-23 20:28:34,397][Main][INFO] - [train] Step 83000 out of 120000 | Loss --> 2.115 | Grad_l2 --> 0.398 | Weights_l2 --> 52050.675 | Lr --> 0.003 | Seconds_per_step --> 2.112 | [2024-04-23 20:32:01,265][Main][INFO] - [train] Step 83100 out of 120000 | Loss --> 2.126 | Grad_l2 --> 0.401 | Weights_l2 --> 52078.531 | Lr --> 0.003 | Seconds_per_step --> 2.069 | [2024-04-23 20:35:32,839][Main][INFO] - [train] Step 83200 out of 120000 | Loss --> 2.118 | Grad_l2 --> 0.397 | Weights_l2 --> 52106.509 | Lr --> 0.003 | Seconds_per_step --> 2.116 | [2024-04-23 20:39:06,739][Main][INFO] - [train] Step 83300 out of 120000 | Loss --> 2.120 | Grad_l2 --> 0.393 | Weights_l2 --> 52134.486 | Lr --> 0.003 | Seconds_per_step --> 2.139 | [2024-04-23 20:42:40,541][Main][INFO] - [train] Step 83400 out of 120000 | Loss --> 2.126 | Grad_l2 --> 0.432 | Weights_l2 --> 52162.543 | Lr --> 0.003 | Seconds_per_step --> 2.138 | [2024-04-23 20:46:11,094][Main][INFO] - [train] Step 83500 out of 120000 | Loss --> 2.095 | Grad_l2 --> 0.403 | Weights_l2 --> 52190.017 | Lr --> 0.003 | Seconds_per_step --> 2.106 | [2024-04-23 20:49:44,569][Main][INFO] - [train] Step 83600 out of 120000 | Loss --> 2.108 | Grad_l2 --> 0.389 | Weights_l2 --> 52217.817 | Lr --> 0.003 | Seconds_per_step --> 2.135 | [2024-04-23 20:53:15,839][Main][INFO] - [train] Step 83700 out of 120000 | Loss --> 2.112 | Grad_l2 --> 0.404 | Weights_l2 --> 52245.747 | Lr --> 0.003 | Seconds_per_step --> 2.113 | [2024-04-23 20:56:46,543][Main][INFO] - [train] Step 83800 out of 120000 | Loss --> 2.107 | Grad_l2 --> 0.394 | Weights_l2 --> 52274.061 | Lr --> 0.003 | Seconds_per_step --> 2.107 | [2024-04-23 21:00:16,483][Main][INFO] - [train] Step 83900 out of 120000 | Loss --> 2.103 | Grad_l2 --> 0.388 | Weights_l2 --> 52302.057 | Lr --> 0.003 | Seconds_per_step --> 2.099 | [2024-04-23 21:03:49,670][Main][INFO] - [train] Step 84000 out of 120000 | Loss --> 2.117 | Grad_l2 --> 0.396 | Weights_l2 --> 52330.118 | Lr --> 0.003 | Seconds_per_step --> 2.132 | [2024-04-23 21:07:21,139][Main][INFO] - [train] Step 84100 out of 120000 | Loss --> 2.110 | Grad_l2 --> 0.385 | Weights_l2 --> 52357.990 | Lr --> 0.003 | Seconds_per_step --> 2.115 | [2024-04-23 21:10:55,410][Main][INFO] - [train] Step 84200 out of 120000 | Loss --> 2.119 | Grad_l2 --> 0.399 | Weights_l2 --> 52385.969 | Lr --> 0.003 | Seconds_per_step --> 2.143 | [2024-04-23 21:14:24,296][Main][INFO] - [train] Step 84300 out of 120000 | Loss --> 2.115 | Grad_l2 --> 0.397 | Weights_l2 --> 52413.910 | Lr --> 0.003 | Seconds_per_step --> 2.089 | [2024-04-23 21:17:56,066][Main][INFO] - [train] Step 84400 out of 120000 | Loss --> 2.123 | Grad_l2 --> 0.390 | Weights_l2 --> 52441.637 | Lr --> 0.003 | Seconds_per_step --> 2.118 | [2024-04-23 21:21:26,643][Main][INFO] - [train] Step 84500 out of 120000 | Loss --> 2.118 | Grad_l2 --> 0.398 | Weights_l2 --> 52469.678 | Lr --> 0.003 | Seconds_per_step --> 2.106 | [2024-04-23 21:24:59,782][Main][INFO] - [train] Step 84600 out of 120000 | Loss --> 2.101 | Grad_l2 --> 0.396 | Weights_l2 --> 52497.613 | Lr --> 0.003 | Seconds_per_step --> 2.131 | [2024-04-23 21:28:32,642][Main][INFO] - [train] Step 84700 out of 120000 | Loss --> 2.108 | Grad_l2 --> 0.411 | Weights_l2 --> 52525.170 | Lr --> 0.003 | Seconds_per_step --> 2.129 | [2024-04-23 21:32:03,994][Main][INFO] - [train] Step 84800 out of 120000 | Loss --> 2.089 | Grad_l2 --> 0.395 | Weights_l2 --> 52553.342 | Lr --> 0.003 | Seconds_per_step --> 2.114 | [2024-04-23 21:35:34,004][Main][INFO] - [train] Step 84900 out of 120000 | Loss --> 2.083 | Grad_l2 --> 0.389 | Weights_l2 --> 52580.933 | Lr --> 0.003 | Seconds_per_step --> 2.100 | [2024-04-23 21:39:05,566][Main][INFO] - [train] Step 85000 out of 120000 | Loss --> 2.098 | Grad_l2 --> 0.396 | Weights_l2 --> 52608.317 | Lr --> 0.003 | Seconds_per_step --> 2.116 | [2024-04-23 21:39:05,828][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-23 21:43:28,189][Main][INFO] - [eval] Step 85000 out of 120000 | Loss --> 1.961 | Accuracy --> 0.649 | Time --> 262.621 | [2024-04-23 21:47:00,576][Main][INFO] - [train] Step 85100 out of 120000 | Loss --> 2.100 | Grad_l2 --> 0.406 | Weights_l2 --> 52635.686 | Lr --> 0.003 | Seconds_per_step --> 2.124 | [2024-04-23 21:50:33,795][Main][INFO] - [train] Step 85200 out of 120000 | Loss --> 2.106 | Grad_l2 --> 0.403 | Weights_l2 --> 52663.326 | Lr --> 0.003 | Seconds_per_step --> 2.132 | [2024-04-23 21:54:06,239][Main][INFO] - [train] Step 85300 out of 120000 | Loss --> 2.109 | Grad_l2 --> 0.400 | Weights_l2 --> 52690.954 | Lr --> 0.003 | Seconds_per_step --> 2.124 | [2024-04-23 21:57:35,144][Main][INFO] - [train] Step 85400 out of 120000 | Loss --> 2.103 | Grad_l2 --> 0.394 | Weights_l2 --> 52718.151 | Lr --> 0.003 | Seconds_per_step --> 2.089 | [2024-04-23 22:01:06,753][Main][INFO] - [train] Step 85500 out of 120000 | Loss --> 2.100 | Grad_l2 --> 0.395 | Weights_l2 --> 52745.502 | Lr --> 0.003 | Seconds_per_step --> 2.116 | [2024-04-23 22:04:39,865][Main][INFO] - [train] Step 85600 out of 120000 | Loss --> 2.103 | Grad_l2 --> 0.392 | Weights_l2 --> 52772.861 | Lr --> 0.003 | Seconds_per_step --> 2.131 | [2024-04-23 22:08:10,684][Main][INFO] - [train] Step 85700 out of 120000 | Loss --> 2.107 | Grad_l2 --> 0.394 | Weights_l2 --> 52800.474 | Lr --> 0.003 | Seconds_per_step --> 2.108 | [2024-04-23 22:11:43,049][Main][INFO] - [train] Step 85800 out of 120000 | Loss --> 2.087 | Grad_l2 --> 0.396 | Weights_l2 --> 52828.179 | Lr --> 0.003 | Seconds_per_step --> 2.124 | [2024-04-23 22:15:15,114][Main][INFO] - [train] Step 85900 out of 120000 | Loss --> 2.098 | Grad_l2 --> 0.391 | Weights_l2 --> 52855.992 | Lr --> 0.003 | Seconds_per_step --> 2.121 | [2024-04-23 22:18:45,474][Main][INFO] - [train] Step 86000 out of 120000 | Loss --> 2.084 | Grad_l2 --> 0.386 | Weights_l2 --> 52883.110 | Lr --> 0.003 | Seconds_per_step --> 2.104 | [2024-04-23 22:22:16,141][Main][INFO] - [train] Step 86100 out of 120000 | Loss --> 2.091 | Grad_l2 --> 0.398 | Weights_l2 --> 52910.738 | Lr --> 0.003 | Seconds_per_step --> 2.107 | [2024-04-23 22:25:49,069][Main][INFO] - [train] Step 86200 out of 120000 | Loss --> 2.088 | Grad_l2 --> 0.390 | Weights_l2 --> 52938.134 | Lr --> 0.003 | Seconds_per_step --> 2.129 | [2024-04-23 22:29:22,567][Main][INFO] - [train] Step 86300 out of 120000 | Loss --> 2.092 | Grad_l2 --> 0.393 | Weights_l2 --> 52965.406 | Lr --> 0.003 | Seconds_per_step --> 2.135 | [2024-04-23 22:32:54,586][Main][INFO] - [train] Step 86400 out of 120000 | Loss --> 2.089 | Grad_l2 --> 0.405 | Weights_l2 --> 52993.180 | Lr --> 0.003 | Seconds_per_step --> 2.120 | [2024-04-23 22:36:24,467][Main][INFO] - [train] Step 86500 out of 120000 | Loss --> 2.084 | Grad_l2 --> 0.392 | Weights_l2 --> 53020.658 | Lr --> 0.003 | Seconds_per_step --> 2.099 | [2024-04-23 22:39:55,068][Main][INFO] - [train] Step 86600 out of 120000 | Loss --> 2.096 | Grad_l2 --> 0.392 | Weights_l2 --> 53047.411 | Lr --> 0.003 | Seconds_per_step --> 2.106 | [2024-04-23 22:43:26,898][Main][INFO] - [train] Step 86700 out of 120000 | Loss --> 2.087 | Grad_l2 --> 0.397 | Weights_l2 --> 53074.691 | Lr --> 0.003 | Seconds_per_step --> 2.118 | [2024-04-23 22:46:59,139][Main][INFO] - [train] Step 86800 out of 120000 | Loss --> 2.105 | Grad_l2 --> 0.397 | Weights_l2 --> 53102.427 | Lr --> 0.003 | Seconds_per_step --> 2.122 | [2024-04-23 22:50:32,386][Main][INFO] - [train] Step 86900 out of 120000 | Loss --> 2.084 | Grad_l2 --> 0.401 | Weights_l2 --> 53129.698 | Lr --> 0.003 | Seconds_per_step --> 2.132 | [2024-04-23 22:54:08,539][Main][INFO] - [train] Step 87000 out of 120000 | Loss --> 2.099 | Grad_l2 --> 0.393 | Weights_l2 --> 53156.481 | Lr --> 0.003 | Seconds_per_step --> 2.162 | [2024-04-23 22:57:36,169][Main][INFO] - [train] Step 87100 out of 120000 | Loss --> 2.102 | Grad_l2 --> 0.390 | Weights_l2 --> 53183.825 | Lr --> 0.003 | Seconds_per_step --> 2.076 | [2024-04-23 23:01:07,500][Main][INFO] - [train] Step 87200 out of 120000 | Loss --> 2.098 | Grad_l2 --> 0.398 | Weights_l2 --> 53210.852 | Lr --> 0.003 | Seconds_per_step --> 2.113 | [2024-04-23 23:04:42,587][Main][INFO] - [train] Step 87300 out of 120000 | Loss --> 2.100 | Grad_l2 --> 0.392 | Weights_l2 --> 53237.888 | Lr --> 0.003 | Seconds_per_step --> 2.151 | [2024-04-23 23:08:12,495][Main][INFO] - [train] Step 87400 out of 120000 | Loss --> 2.105 | Grad_l2 --> 0.397 | Weights_l2 --> 53264.880 | Lr --> 0.003 | Seconds_per_step --> 2.099 | [2024-04-23 23:11:44,977][Main][INFO] - [train] Step 87500 out of 120000 | Loss --> 2.109 | Grad_l2 --> 0.396 | Weights_l2 --> 53292.193 | Lr --> 0.003 | Seconds_per_step --> 2.125 | [2024-04-23 23:15:16,676][Main][INFO] - [train] Step 87600 out of 120000 | Loss --> 2.087 | Grad_l2 --> 0.394 | Weights_l2 --> 53319.393 | Lr --> 0.003 | Seconds_per_step --> 2.117 | [2024-04-23 23:18:51,621][Main][INFO] - [train] Step 87700 out of 120000 | Loss --> 2.097 | Grad_l2 --> 0.395 | Weights_l2 --> 53346.123 | Lr --> 0.003 | Seconds_per_step --> 2.149 | [2024-04-23 23:22:24,794][Main][INFO] - [train] Step 87800 out of 120000 | Loss --> 2.108 | Grad_l2 --> 0.394 | Weights_l2 --> 53373.459 | Lr --> 0.003 | Seconds_per_step --> 2.132 | [2024-04-23 23:25:57,566][Main][INFO] - [train] Step 87900 out of 120000 | Loss --> 2.100 | Grad_l2 --> 0.389 | Weights_l2 --> 53400.048 | Lr --> 0.003 | Seconds_per_step --> 2.128 | [2024-04-23 23:29:27,381][Main][INFO] - [train] Step 88000 out of 120000 | Loss --> 2.122 | Grad_l2 --> 0.397 | Weights_l2 --> 53427.170 | Lr --> 0.003 | Seconds_per_step --> 2.098 | [2024-04-23 23:32:59,107][Main][INFO] - [train] Step 88100 out of 120000 | Loss --> 2.117 | Grad_l2 --> 0.399 | Weights_l2 --> 53454.100 | Lr --> 0.003 | Seconds_per_step --> 2.117 | [2024-04-23 23:36:32,346][Main][INFO] - [train] Step 88200 out of 120000 | Loss --> 2.101 | Grad_l2 --> 0.402 | Weights_l2 --> 53481.024 | Lr --> 0.003 | Seconds_per_step --> 2.132 | [2024-04-23 23:40:05,640][Main][INFO] - [train] Step 88300 out of 120000 | Loss --> 2.100 | Grad_l2 --> 0.397 | Weights_l2 --> 53507.922 | Lr --> 0.003 | Seconds_per_step --> 2.133 | [2024-04-23 23:43:37,266][Main][INFO] - [train] Step 88400 out of 120000 | Loss --> 2.105 | Grad_l2 --> 0.401 | Weights_l2 --> 53534.634 | Lr --> 0.003 | Seconds_per_step --> 2.116 | [2024-04-23 23:47:09,266][Main][INFO] - [train] Step 88500 out of 120000 | Loss --> 2.095 | Grad_l2 --> 0.396 | Weights_l2 --> 53561.502 | Lr --> 0.003 | Seconds_per_step --> 2.120 | [2024-04-23 23:50:37,774][Main][INFO] - [train] Step 88600 out of 120000 | Loss --> 2.109 | Grad_l2 --> 0.407 | Weights_l2 --> 53588.558 | Lr --> 0.003 | Seconds_per_step --> 2.085 | [2024-04-23 23:54:11,100][Main][INFO] - [train] Step 88700 out of 120000 | Loss --> 2.098 | Grad_l2 --> 0.443 | Weights_l2 --> 53615.044 | Lr --> 0.003 | Seconds_per_step --> 2.133 | [2024-04-23 23:57:41,945][Main][INFO] - [train] Step 88800 out of 120000 | Loss --> 2.107 | Grad_l2 --> 0.392 | Weights_l2 --> 53641.743 | Lr --> 0.003 | Seconds_per_step --> 2.108 | [2024-04-24 00:01:13,697][Main][INFO] - [train] Step 88900 out of 120000 | Loss --> 2.108 | Grad_l2 --> 0.393 | Weights_l2 --> 53668.567 | Lr --> 0.003 | Seconds_per_step --> 2.118 | [2024-04-24 00:04:45,738][Main][INFO] - [train] Step 89000 out of 120000 | Loss --> 2.114 | Grad_l2 --> 0.390 | Weights_l2 --> 53695.165 | Lr --> 0.003 | Seconds_per_step --> 2.120 | [2024-04-24 00:08:16,677][Main][INFO] - [train] Step 89100 out of 120000 | Loss --> 2.121 | Grad_l2 --> 0.396 | Weights_l2 --> 53721.934 | Lr --> 0.003 | Seconds_per_step --> 2.109 | [2024-04-24 00:11:46,240][Main][INFO] - [train] Step 89200 out of 120000 | Loss --> 2.112 | Grad_l2 --> 0.395 | Weights_l2 --> 53748.310 | Lr --> 0.003 | Seconds_per_step --> 2.096 | [2024-04-24 00:15:19,071][Main][INFO] - [train] Step 89300 out of 120000 | Loss --> 2.104 | Grad_l2 --> 0.386 | Weights_l2 --> 53774.721 | Lr --> 0.003 | Seconds_per_step --> 2.128 | [2024-04-24 00:18:48,866][Main][INFO] - [train] Step 89400 out of 120000 | Loss --> 2.099 | Grad_l2 --> 0.393 | Weights_l2 --> 53801.390 | Lr --> 0.003 | Seconds_per_step --> 2.098 | [2024-04-24 00:22:19,721][Main][INFO] - [train] Step 89500 out of 120000 | Loss --> 2.104 | Grad_l2 --> 0.396 | Weights_l2 --> 53828.021 | Lr --> 0.003 | Seconds_per_step --> 2.109 | [2024-04-24 00:25:52,267][Main][INFO] - [train] Step 89600 out of 120000 | Loss --> 2.104 | Grad_l2 --> 0.402 | Weights_l2 --> 53854.458 | Lr --> 0.003 | Seconds_per_step --> 2.125 | [2024-04-24 00:29:23,633][Main][INFO] - [train] Step 89700 out of 120000 | Loss --> 2.102 | Grad_l2 --> 0.389 | Weights_l2 --> 53881.711 | Lr --> 0.003 | Seconds_per_step --> 2.114 | [2024-04-24 00:32:56,895][Main][INFO] - [train] Step 89800 out of 120000 | Loss --> 2.092 | Grad_l2 --> 0.394 | Weights_l2 --> 53908.671 | Lr --> 0.003 | Seconds_per_step --> 2.133 | [2024-04-24 00:36:28,838][Main][INFO] - [train] Step 89900 out of 120000 | Loss --> 2.089 | Grad_l2 --> 0.402 | Weights_l2 --> 53935.541 | Lr --> 0.003 | Seconds_per_step --> 2.119 | [2024-04-24 00:40:00,250][Main][INFO] - [train] Step 90000 out of 120000 | Loss --> 2.091 | Grad_l2 --> 0.401 | Weights_l2 --> 53962.472 | Lr --> 0.003 | Seconds_per_step --> 2.114 | [2024-04-24 00:40:00,488][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-24 00:44:22,599][Main][INFO] - [eval] Step 90000 out of 120000 | Loss --> 1.939 | Accuracy --> 0.651 | Time --> 262.346 | [2024-04-24 00:44:22,603][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-90000 [2024-04-24 00:44:22,606][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-04-24 00:44:27,275][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-90000/model.safetensors [2024-04-24 00:44:27,327][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-90000/optimizer.bin [2024-04-24 00:44:27,328][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-90000/scheduler.bin [2024-04-24 00:44:27,328][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-90000/sampler.bin [2024-04-24 00:44:27,329][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-90000/sampler_1.bin [2024-04-24 00:44:27,330][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-90000/random_states_0.pkl [2024-04-24 00:47:57,667][Main][INFO] - [train] Step 90100 out of 120000 | Loss --> 2.101 | Grad_l2 --> 0.397 | Weights_l2 --> 53989.348 | Lr --> 0.003 | Seconds_per_step --> 2.151 | [2024-04-24 00:51:29,104][Main][INFO] - [train] Step 90200 out of 120000 | Loss --> 2.102 | Grad_l2 --> 0.395 | Weights_l2 --> 54016.624 | Lr --> 0.003 | Seconds_per_step --> 2.114 | [2024-04-24 00:54:59,666][Main][INFO] - [train] Step 90300 out of 120000 | Loss --> 2.078 | Grad_l2 --> 0.402 | Weights_l2 --> 54043.556 | Lr --> 0.003 | Seconds_per_step --> 2.106 | [2024-04-24 00:58:31,966][Main][INFO] - [train] Step 90400 out of 120000 | Loss --> 2.091 | Grad_l2 --> 0.395 | Weights_l2 --> 54069.966 | Lr --> 0.003 | Seconds_per_step --> 2.123 | [2024-04-24 01:02:05,068][Main][INFO] - [train] Step 90500 out of 120000 | Loss --> 2.079 | Grad_l2 --> 0.391 | Weights_l2 --> 54096.500 | Lr --> 0.003 | Seconds_per_step --> 2.131 | [2024-04-24 01:05:39,595][Main][INFO] - [train] Step 90600 out of 120000 | Loss --> 2.101 | Grad_l2 --> 0.398 | Weights_l2 --> 54122.838 | Lr --> 0.003 | Seconds_per_step --> 2.145 | [2024-04-24 01:09:10,738][Main][INFO] - [train] Step 90700 out of 120000 | Loss --> 2.108 | Grad_l2 --> 0.410 | Weights_l2 --> 54149.468 | Lr --> 0.003 | Seconds_per_step --> 2.111 | [2024-04-24 01:12:43,127][Main][INFO] - [train] Step 90800 out of 120000 | Loss --> 2.085 | Grad_l2 --> 0.393 | Weights_l2 --> 54175.549 | Lr --> 0.003 | Seconds_per_step --> 2.124 | [2024-04-24 01:16:14,635][Main][INFO] - [train] Step 90900 out of 120000 | Loss --> 2.079 | Grad_l2 --> 0.386 | Weights_l2 --> 54201.640 | Lr --> 0.003 | Seconds_per_step --> 2.115 | [2024-04-24 01:19:46,279][Main][INFO] - [train] Step 91000 out of 120000 | Loss --> 2.102 | Grad_l2 --> 0.398 | Weights_l2 --> 54228.459 | Lr --> 0.003 | Seconds_per_step --> 2.116 | [2024-04-24 01:23:15,338][Main][INFO] - [train] Step 91100 out of 120000 | Loss --> 2.092 | Grad_l2 --> 0.397 | Weights_l2 --> 54254.787 | Lr --> 0.003 | Seconds_per_step --> 2.091 | [2024-04-24 01:26:49,767][Main][INFO] - [train] Step 91200 out of 120000 | Loss --> 2.083 | Grad_l2 --> 0.399 | Weights_l2 --> 54281.595 | Lr --> 0.003 | Seconds_per_step --> 2.144 | [2024-04-24 01:30:22,684][Main][INFO] - [train] Step 91300 out of 120000 | Loss --> 2.078 | Grad_l2 --> 0.399 | Weights_l2 --> 54308.134 | Lr --> 0.003 | Seconds_per_step --> 2.129 | [2024-04-24 01:33:53,939][Main][INFO] - [train] Step 91400 out of 120000 | Loss --> 2.081 | Grad_l2 --> 0.406 | Weights_l2 --> 54334.708 | Lr --> 0.003 | Seconds_per_step --> 2.113 | [2024-04-24 01:37:25,066][Main][INFO] - [train] Step 91500 out of 120000 | Loss --> 2.085 | Grad_l2 --> 0.386 | Weights_l2 --> 54360.458 | Lr --> 0.003 | Seconds_per_step --> 2.111 | [2024-04-24 01:40:58,106][Main][INFO] - [train] Step 91600 out of 120000 | Loss --> 2.095 | Grad_l2 --> 0.405 | Weights_l2 --> 54387.131 | Lr --> 0.003 | Seconds_per_step --> 2.130 | [2024-04-24 01:44:30,564][Main][INFO] - [train] Step 91700 out of 120000 | Loss --> 2.073 | Grad_l2 --> 0.401 | Weights_l2 --> 54413.450 | Lr --> 0.003 | Seconds_per_step --> 2.125 | [2024-04-24 01:47:58,131][Main][INFO] - [train] Step 91800 out of 120000 | Loss --> 2.095 | Grad_l2 --> 0.394 | Weights_l2 --> 54439.788 | Lr --> 0.003 | Seconds_per_step --> 2.076 | [2024-04-24 01:51:30,686][Main][INFO] - [train] Step 91900 out of 120000 | Loss --> 2.078 | Grad_l2 --> 0.395 | Weights_l2 --> 54466.347 | Lr --> 0.003 | Seconds_per_step --> 2.126 | [2024-04-24 01:55:06,887][Main][INFO] - [train] Step 92000 out of 120000 | Loss --> 2.073 | Grad_l2 --> 0.403 | Weights_l2 --> 54492.111 | Lr --> 0.003 | Seconds_per_step --> 2.162 | [2024-04-24 01:58:37,795][Main][INFO] - [train] Step 92100 out of 120000 | Loss --> 2.087 | Grad_l2 --> 0.398 | Weights_l2 --> 54518.515 | Lr --> 0.003 | Seconds_per_step --> 2.109 | [2024-04-24 02:02:08,045][Main][INFO] - [train] Step 92200 out of 120000 | Loss --> 2.089 | Grad_l2 --> 0.386 | Weights_l2 --> 54544.440 | Lr --> 0.003 | Seconds_per_step --> 2.102 | [2024-04-24 02:05:41,271][Main][INFO] - [train] Step 92300 out of 120000 | Loss --> 2.081 | Grad_l2 --> 0.401 | Weights_l2 --> 54570.522 | Lr --> 0.003 | Seconds_per_step --> 2.132 | [2024-04-24 02:09:13,794][Main][INFO] - [train] Step 92400 out of 120000 | Loss --> 2.080 | Grad_l2 --> 0.390 | Weights_l2 --> 54596.746 | Lr --> 0.003 | Seconds_per_step --> 2.125 | [2024-04-24 02:12:45,686][Main][INFO] - [train] Step 92500 out of 120000 | Loss --> 2.092 | Grad_l2 --> 0.402 | Weights_l2 --> 54622.816 | Lr --> 0.003 | Seconds_per_step --> 2.119 | [2024-04-24 02:16:21,793][Main][INFO] - [train] Step 92600 out of 120000 | Loss --> 2.082 | Grad_l2 --> 0.395 | Weights_l2 --> 54649.228 | Lr --> 0.003 | Seconds_per_step --> 2.161 | [2024-04-24 02:19:50,798][Main][INFO] - [train] Step 92700 out of 120000 | Loss --> 2.093 | Grad_l2 --> 0.396 | Weights_l2 --> 54675.367 | Lr --> 0.003 | Seconds_per_step --> 2.090 | [2024-04-24 02:23:21,002][Main][INFO] - [train] Step 92800 out of 120000 | Loss --> 2.087 | Grad_l2 --> 0.396 | Weights_l2 --> 54701.474 | Lr --> 0.003 | Seconds_per_step --> 2.102 | [2024-04-24 02:26:52,182][Main][INFO] - [train] Step 92900 out of 120000 | Loss --> 2.091 | Grad_l2 --> 0.409 | Weights_l2 --> 54727.586 | Lr --> 0.003 | Seconds_per_step --> 2.112 | [2024-04-24 02:30:22,840][Main][INFO] - [train] Step 93000 out of 120000 | Loss --> 2.079 | Grad_l2 --> 0.397 | Weights_l2 --> 54753.518 | Lr --> 0.003 | Seconds_per_step --> 2.107 | [2024-04-24 02:33:54,868][Main][INFO] - [train] Step 93100 out of 120000 | Loss --> 2.092 | Grad_l2 --> 0.409 | Weights_l2 --> 54779.446 | Lr --> 0.003 | Seconds_per_step --> 2.120 | [2024-04-24 02:37:27,503][Main][INFO] - [train] Step 93200 out of 120000 | Loss --> 2.092 | Grad_l2 --> 0.394 | Weights_l2 --> 54805.660 | Lr --> 0.003 | Seconds_per_step --> 2.126 | [2024-04-24 02:40:58,574][Main][INFO] - [train] Step 93300 out of 120000 | Loss --> 2.086 | Grad_l2 --> 0.394 | Weights_l2 --> 54831.763 | Lr --> 0.003 | Seconds_per_step --> 2.111 | [2024-04-24 02:44:30,116][Main][INFO] - [train] Step 93400 out of 120000 | Loss --> 2.091 | Grad_l2 --> 1.280 | Weights_l2 --> 54857.849 | Lr --> 0.003 | Seconds_per_step --> 2.115 | [2024-04-24 02:48:01,319][Main][INFO] - [train] Step 93500 out of 120000 | Loss --> 2.090 | Grad_l2 --> 0.431 | Weights_l2 --> 54884.042 | Lr --> 0.003 | Seconds_per_step --> 2.112 | [2024-04-24 02:51:33,367][Main][INFO] - [train] Step 93600 out of 120000 | Loss --> 2.086 | Grad_l2 --> 0.399 | Weights_l2 --> 54909.819 | Lr --> 0.003 | Seconds_per_step --> 2.120 | [2024-04-24 02:55:07,038][Main][INFO] - [train] Step 93700 out of 120000 | Loss --> 2.083 | Grad_l2 --> 0.392 | Weights_l2 --> 54935.896 | Lr --> 0.003 | Seconds_per_step --> 2.137 | [2024-04-24 02:58:37,489][Main][INFO] - [train] Step 93800 out of 120000 | Loss --> 2.092 | Grad_l2 --> 0.394 | Weights_l2 --> 54962.119 | Lr --> 0.003 | Seconds_per_step --> 2.105 | [2024-04-24 03:02:11,267][Main][INFO] - [train] Step 93900 out of 120000 | Loss --> 2.079 | Grad_l2 --> 0.402 | Weights_l2 --> 54987.883 | Lr --> 0.003 | Seconds_per_step --> 2.138 | [2024-04-24 03:05:43,143][Main][INFO] - [train] Step 94000 out of 120000 | Loss --> 2.078 | Grad_l2 --> 0.397 | Weights_l2 --> 55014.079 | Lr --> 0.003 | Seconds_per_step --> 2.119 | [2024-04-24 03:09:11,993][Main][INFO] - [train] Step 94100 out of 120000 | Loss --> 2.076 | Grad_l2 --> 0.397 | Weights_l2 --> 55040.379 | Lr --> 0.003 | Seconds_per_step --> 2.088 | [2024-04-24 03:12:45,442][Main][INFO] - [train] Step 94200 out of 120000 | Loss --> 2.080 | Grad_l2 --> 0.387 | Weights_l2 --> 55066.442 | Lr --> 0.003 | Seconds_per_step --> 2.134 | [2024-04-24 03:16:15,875][Main][INFO] - [train] Step 94300 out of 120000 | Loss --> 2.087 | Grad_l2 --> 0.399 | Weights_l2 --> 55092.590 | Lr --> 0.003 | Seconds_per_step --> 2.104 | [2024-04-24 03:19:47,337][Main][INFO] - [train] Step 94400 out of 120000 | Loss --> 2.084 | Grad_l2 --> 0.397 | Weights_l2 --> 55118.346 | Lr --> 0.003 | Seconds_per_step --> 2.115 | [2024-04-24 03:23:18,666][Main][INFO] - [train] Step 94500 out of 120000 | Loss --> 2.096 | Grad_l2 --> 0.400 | Weights_l2 --> 55144.198 | Lr --> 0.003 | Seconds_per_step --> 2.113 | [2024-04-24 03:26:53,195][Main][INFO] - [train] Step 94600 out of 120000 | Loss --> 2.090 | Grad_l2 --> 0.391 | Weights_l2 --> 55170.343 | Lr --> 0.003 | Seconds_per_step --> 2.145 | [2024-04-24 03:30:26,467][Main][INFO] - [train] Step 94700 out of 120000 | Loss --> 2.074 | Grad_l2 --> 0.409 | Weights_l2 --> 55195.977 | Lr --> 0.003 | Seconds_per_step --> 2.133 | [2024-04-24 03:33:58,090][Main][INFO] - [train] Step 94800 out of 120000 | Loss --> 2.059 | Grad_l2 --> 0.397 | Weights_l2 --> 55221.671 | Lr --> 0.003 | Seconds_per_step --> 2.116 | [2024-04-24 03:37:32,496][Main][INFO] - [train] Step 94900 out of 120000 | Loss --> 2.061 | Grad_l2 --> 0.409 | Weights_l2 --> 55247.808 | Lr --> 0.003 | Seconds_per_step --> 2.144 | [2024-04-24 03:41:05,001][Main][INFO] - [train] Step 95000 out of 120000 | Loss --> 2.074 | Grad_l2 --> 0.403 | Weights_l2 --> 55273.233 | Lr --> 0.003 | Seconds_per_step --> 2.125 | [2024-04-24 03:41:05,232][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-24 03:45:29,036][Main][INFO] - [eval] Step 95000 out of 120000 | Loss --> 1.929 | Accuracy --> 0.653 | Time --> 264.033 | [2024-04-24 03:49:00,781][Main][INFO] - [train] Step 95100 out of 120000 | Loss --> 2.082 | Grad_l2 --> 0.396 | Weights_l2 --> 55299.156 | Lr --> 0.003 | Seconds_per_step --> 2.117 | [2024-04-24 03:52:31,439][Main][INFO] - [train] Step 95200 out of 120000 | Loss --> 2.069 | Grad_l2 --> 0.395 | Weights_l2 --> 55324.610 | Lr --> 0.003 | Seconds_per_step --> 2.107 | [2024-04-24 03:56:02,141][Main][INFO] - [train] Step 95300 out of 120000 | Loss --> 2.075 | Grad_l2 --> 0.398 | Weights_l2 --> 55350.381 | Lr --> 0.003 | Seconds_per_step --> 2.107 | [2024-04-24 03:59:34,750][Main][INFO] - [train] Step 95400 out of 120000 | Loss --> 2.090 | Grad_l2 --> 0.394 | Weights_l2 --> 55376.294 | Lr --> 0.003 | Seconds_per_step --> 2.126 | [2024-04-24 04:03:06,039][Main][INFO] - [train] Step 95500 out of 120000 | Loss --> 2.069 | Grad_l2 --> 0.394 | Weights_l2 --> 55402.028 | Lr --> 0.003 | Seconds_per_step --> 2.113 | [2024-04-24 04:06:37,503][Main][INFO] - [train] Step 95600 out of 120000 | Loss --> 2.074 | Grad_l2 --> 0.396 | Weights_l2 --> 55427.583 | Lr --> 0.003 | Seconds_per_step --> 2.115 | [2024-04-24 04:10:09,197][Main][INFO] - [train] Step 95700 out of 120000 | Loss --> 2.082 | Grad_l2 --> 0.399 | Weights_l2 --> 55453.584 | Lr --> 0.003 | Seconds_per_step --> 2.117 | [2024-04-24 04:13:43,079][Main][INFO] - [train] Step 95800 out of 120000 | Loss --> 2.083 | Grad_l2 --> 0.396 | Weights_l2 --> 55479.252 | Lr --> 0.003 | Seconds_per_step --> 2.139 | [2024-04-24 04:17:16,469][Main][INFO] - [train] Step 95900 out of 120000 | Loss --> 2.072 | Grad_l2 --> 0.404 | Weights_l2 --> 55505.194 | Lr --> 0.003 | Seconds_per_step --> 2.134 | [2024-04-24 04:20:50,939][Main][INFO] - [train] Step 96000 out of 120000 | Loss --> 2.092 | Grad_l2 --> 0.401 | Weights_l2 --> 55531.059 | Lr --> 0.003 | Seconds_per_step --> 2.145 | [2024-04-24 04:24:21,868][Main][INFO] - [train] Step 96100 out of 120000 | Loss --> 2.075 | Grad_l2 --> 0.398 | Weights_l2 --> 55556.879 | Lr --> 0.003 | Seconds_per_step --> 2.109 | [2024-04-24 04:27:50,189][Main][INFO] - [train] Step 96200 out of 120000 | Loss --> 2.075 | Grad_l2 --> 0.404 | Weights_l2 --> 55582.574 | Lr --> 0.003 | Seconds_per_step --> 2.083 | [2024-04-24 04:31:25,615][Main][INFO] - [train] Step 96300 out of 120000 | Loss --> 2.082 | Grad_l2 --> 0.401 | Weights_l2 --> 55608.401 | Lr --> 0.003 | Seconds_per_step --> 2.154 | [2024-04-24 04:34:53,067][Main][INFO] - [train] Step 96400 out of 120000 | Loss --> 2.077 | Grad_l2 --> 0.404 | Weights_l2 --> 55634.418 | Lr --> 0.003 | Seconds_per_step --> 2.075 | [2024-04-24 04:38:27,348][Main][INFO] - [train] Step 96500 out of 120000 | Loss --> 2.083 | Grad_l2 --> 0.401 | Weights_l2 --> 55660.084 | Lr --> 0.003 | Seconds_per_step --> 2.143 | [2024-04-24 04:41:59,467][Main][INFO] - [train] Step 96600 out of 120000 | Loss --> 2.077 | Grad_l2 --> 0.400 | Weights_l2 --> 55685.835 | Lr --> 0.003 | Seconds_per_step --> 2.121 | [2024-04-24 04:45:31,266][Main][INFO] - [train] Step 96700 out of 120000 | Loss --> 2.085 | Grad_l2 --> 0.397 | Weights_l2 --> 55711.808 | Lr --> 0.003 | Seconds_per_step --> 2.118 | [2024-04-24 04:49:04,468][Main][INFO] - [train] Step 96800 out of 120000 | Loss --> 2.068 | Grad_l2 --> 0.406 | Weights_l2 --> 55737.536 | Lr --> 0.003 | Seconds_per_step --> 2.132 | [2024-04-24 04:52:32,766][Main][INFO] - [train] Step 96900 out of 120000 | Loss --> 2.071 | Grad_l2 --> 0.397 | Weights_l2 --> 55763.243 | Lr --> 0.003 | Seconds_per_step --> 2.083 | [2024-04-24 04:56:07,099][Main][INFO] - [train] Step 97000 out of 120000 | Loss --> 2.066 | Grad_l2 --> 0.406 | Weights_l2 --> 55789.224 | Lr --> 0.003 | Seconds_per_step --> 2.143 | [2024-04-24 04:59:37,172][Main][INFO] - [train] Step 97100 out of 120000 | Loss --> 2.094 | Grad_l2 --> 0.408 | Weights_l2 --> 55815.404 | Lr --> 0.003 | Seconds_per_step --> 2.101 | [2024-04-24 05:03:11,169][Main][INFO] - [train] Step 97200 out of 120000 | Loss --> 2.091 | Grad_l2 --> 0.405 | Weights_l2 --> 55841.624 | Lr --> 0.003 | Seconds_per_step --> 2.140 | [2024-04-24 05:06:40,464][Main][INFO] - [train] Step 97300 out of 120000 | Loss --> 2.079 | Grad_l2 --> 0.399 | Weights_l2 --> 55867.308 | Lr --> 0.003 | Seconds_per_step --> 2.093 | [2024-04-24 05:10:10,101][Main][INFO] - [train] Step 97400 out of 120000 | Loss --> 2.068 | Grad_l2 --> 0.398 | Weights_l2 --> 55892.814 | Lr --> 0.003 | Seconds_per_step --> 2.096 | [2024-04-24 05:13:43,894][Main][INFO] - [train] Step 97500 out of 120000 | Loss --> 2.078 | Grad_l2 --> 0.409 | Weights_l2 --> 55918.191 | Lr --> 0.003 | Seconds_per_step --> 2.138 | [2024-04-24 05:17:19,538][Main][INFO] - [train] Step 97600 out of 120000 | Loss --> 2.076 | Grad_l2 --> 0.403 | Weights_l2 --> 55943.175 | Lr --> 0.003 | Seconds_per_step --> 2.156 | [2024-04-24 05:20:47,981][Main][INFO] - [train] Step 97700 out of 120000 | Loss --> 2.062 | Grad_l2 --> 0.403 | Weights_l2 --> 55968.465 | Lr --> 0.003 | Seconds_per_step --> 2.084 | [2024-04-24 05:24:20,094][Main][INFO] - [train] Step 97800 out of 120000 | Loss --> 2.074 | Grad_l2 --> 0.404 | Weights_l2 --> 55993.928 | Lr --> 0.003 | Seconds_per_step --> 2.121 | [2024-04-24 05:27:52,576][Main][INFO] - [train] Step 97900 out of 120000 | Loss --> 2.082 | Grad_l2 --> 0.400 | Weights_l2 --> 56019.311 | Lr --> 0.003 | Seconds_per_step --> 2.125 | [2024-04-24 05:31:23,410][Main][INFO] - [train] Step 98000 out of 120000 | Loss --> 2.066 | Grad_l2 --> 0.390 | Weights_l2 --> 56044.828 | Lr --> 0.003 | Seconds_per_step --> 2.108 | [2024-04-24 05:34:53,839][Main][INFO] - [train] Step 98100 out of 120000 | Loss --> 2.075 | Grad_l2 --> 0.408 | Weights_l2 --> 56071.026 | Lr --> 0.003 | Seconds_per_step --> 2.104 | [2024-04-24 05:38:24,938][Main][INFO] - [train] Step 98200 out of 120000 | Loss --> 2.075 | Grad_l2 --> 0.403 | Weights_l2 --> 56096.532 | Lr --> 0.003 | Seconds_per_step --> 2.111 | [2024-04-24 05:41:55,439][Main][INFO] - [train] Step 98300 out of 120000 | Loss --> 2.059 | Grad_l2 --> 0.391 | Weights_l2 --> 56121.920 | Lr --> 0.003 | Seconds_per_step --> 2.105 | [2024-04-24 05:45:28,994][Main][INFO] - [train] Step 98400 out of 120000 | Loss --> 2.082 | Grad_l2 --> 0.403 | Weights_l2 --> 56147.574 | Lr --> 0.003 | Seconds_per_step --> 2.136 | [2024-04-24 05:49:01,296][Main][INFO] - [train] Step 98500 out of 120000 | Loss --> 2.077 | Grad_l2 --> 0.407 | Weights_l2 --> 56172.737 | Lr --> 0.003 | Seconds_per_step --> 2.123 | [2024-04-24 05:52:35,794][Main][INFO] - [train] Step 98600 out of 120000 | Loss --> 2.079 | Grad_l2 --> 0.403 | Weights_l2 --> 56198.308 | Lr --> 0.003 | Seconds_per_step --> 2.145 | [2024-04-24 05:56:06,073][Main][INFO] - [train] Step 98700 out of 120000 | Loss --> 2.061 | Grad_l2 --> 0.399 | Weights_l2 --> 56223.820 | Lr --> 0.003 | Seconds_per_step --> 2.103 | [2024-04-24 05:59:36,209][Main][INFO] - [train] Step 98800 out of 120000 | Loss --> 2.055 | Grad_l2 --> 0.397 | Weights_l2 --> 56249.430 | Lr --> 0.003 | Seconds_per_step --> 2.101 | [2024-04-24 06:03:08,440][Main][INFO] - [train] Step 98900 out of 120000 | Loss --> 2.064 | Grad_l2 --> 0.399 | Weights_l2 --> 56274.921 | Lr --> 0.003 | Seconds_per_step --> 2.122 | [2024-04-24 06:06:41,879][Main][INFO] - [train] Step 99000 out of 120000 | Loss --> 2.056 | Grad_l2 --> 0.400 | Weights_l2 --> 56300.340 | Lr --> 0.003 | Seconds_per_step --> 2.134 | [2024-04-24 06:10:11,738][Main][INFO] - [train] Step 99100 out of 120000 | Loss --> 2.059 | Grad_l2 --> 0.397 | Weights_l2 --> 56325.273 | Lr --> 0.003 | Seconds_per_step --> 2.099 | [2024-04-24 06:13:47,794][Main][INFO] - [train] Step 99200 out of 120000 | Loss --> 2.069 | Grad_l2 --> 0.399 | Weights_l2 --> 56350.893 | Lr --> 0.003 | Seconds_per_step --> 2.161 | [2024-04-24 06:17:15,787][Main][INFO] - [train] Step 99300 out of 120000 | Loss --> 2.066 | Grad_l2 --> 0.412 | Weights_l2 --> 56375.784 | Lr --> 0.003 | Seconds_per_step --> 2.080 | [2024-04-24 06:20:46,011][Main][INFO] - [train] Step 99400 out of 120000 | Loss --> 2.056 | Grad_l2 --> 0.397 | Weights_l2 --> 56401.078 | Lr --> 0.003 | Seconds_per_step --> 2.102 | [2024-04-24 06:24:17,237][Main][INFO] - [train] Step 99500 out of 120000 | Loss --> 2.056 | Grad_l2 --> 0.396 | Weights_l2 --> 56426.516 | Lr --> 0.003 | Seconds_per_step --> 2.112 | [2024-04-24 06:27:50,944][Main][INFO] - [train] Step 99600 out of 120000 | Loss --> 2.058 | Grad_l2 --> 0.390 | Weights_l2 --> 56451.694 | Lr --> 0.003 | Seconds_per_step --> 2.137 | [2024-04-24 06:31:23,495][Main][INFO] - [train] Step 99700 out of 120000 | Loss --> 2.054 | Grad_l2 --> 0.410 | Weights_l2 --> 56476.836 | Lr --> 0.003 | Seconds_per_step --> 2.126 | [2024-04-24 06:34:56,578][Main][INFO] - [train] Step 99800 out of 120000 | Loss --> 2.067 | Grad_l2 --> 0.406 | Weights_l2 --> 56502.362 | Lr --> 0.003 | Seconds_per_step --> 2.131 | [2024-04-24 06:38:27,238][Main][INFO] - [train] Step 99900 out of 120000 | Loss --> 2.074 | Grad_l2 --> 0.413 | Weights_l2 --> 56527.723 | Lr --> 0.003 | Seconds_per_step --> 2.107 | [2024-04-24 06:41:59,838][Main][INFO] - [train] Step 100000 out of 120000 | Loss --> 2.075 | Grad_l2 --> 0.391 | Weights_l2 --> 56552.599 | Lr --> 0.003 | Seconds_per_step --> 2.126 | [2024-04-24 06:42:00,056][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-24 06:46:28,857][Main][INFO] - [eval] Step 100000 out of 120000 | Loss --> 1.911 | Accuracy --> 0.656 | Time --> 269.016 | [2024-04-24 06:46:28,861][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-100000 [2024-04-24 06:46:28,864][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-04-24 06:46:33,798][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-100000/model.safetensors [2024-04-24 06:46:33,850][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-100000/optimizer.bin [2024-04-24 06:46:33,852][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-100000/scheduler.bin [2024-04-24 06:46:33,852][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-100000/sampler.bin [2024-04-24 06:46:33,852][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-100000/sampler_1.bin [2024-04-24 06:46:33,853][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-100000/random_states_0.pkl [2024-04-24 06:50:03,896][Main][INFO] - [train] Step 100100 out of 120000 | Loss --> 2.054 | Grad_l2 --> 0.399 | Weights_l2 --> 56578.070 | Lr --> 0.003 | Seconds_per_step --> 2.150 | [2024-04-24 06:53:36,638][Main][INFO] - [train] Step 100200 out of 120000 | Loss --> 2.070 | Grad_l2 --> 0.401 | Weights_l2 --> 56603.217 | Lr --> 0.003 | Seconds_per_step --> 2.127 | [2024-04-24 06:57:08,467][Main][INFO] - [train] Step 100300 out of 120000 | Loss --> 2.066 | Grad_l2 --> 0.394 | Weights_l2 --> 56628.148 | Lr --> 0.003 | Seconds_per_step --> 2.118 | [2024-04-24 07:00:40,466][Main][INFO] - [train] Step 100400 out of 120000 | Loss --> 2.065 | Grad_l2 --> 0.398 | Weights_l2 --> 56653.686 | Lr --> 0.003 | Seconds_per_step --> 2.120 | [2024-04-24 07:04:10,192][Main][INFO] - [train] Step 100500 out of 120000 | Loss --> 2.054 | Grad_l2 --> 0.394 | Weights_l2 --> 56678.936 | Lr --> 0.003 | Seconds_per_step --> 2.097 | [2024-04-24 07:07:41,572][Main][INFO] - [train] Step 100600 out of 120000 | Loss --> 2.057 | Grad_l2 --> 0.402 | Weights_l2 --> 56704.358 | Lr --> 0.003 | Seconds_per_step --> 2.114 | [2024-04-24 07:11:11,353][Main][INFO] - [train] Step 100700 out of 120000 | Loss --> 2.066 | Grad_l2 --> 0.403 | Weights_l2 --> 56729.595 | Lr --> 0.003 | Seconds_per_step --> 2.098 | [2024-04-24 07:14:43,748][Main][INFO] - [train] Step 100800 out of 120000 | Loss --> 2.069 | Grad_l2 --> 0.403 | Weights_l2 --> 56754.432 | Lr --> 0.003 | Seconds_per_step --> 2.124 | [2024-04-24 07:18:16,495][Main][INFO] - [train] Step 100900 out of 120000 | Loss --> 2.069 | Grad_l2 --> 0.401 | Weights_l2 --> 56779.860 | Lr --> 0.003 | Seconds_per_step --> 2.127 | [2024-04-24 07:21:48,713][Main][INFO] - [train] Step 101000 out of 120000 | Loss --> 2.073 | Grad_l2 --> 0.404 | Weights_l2 --> 56805.325 | Lr --> 0.003 | Seconds_per_step --> 2.122 | [2024-04-24 07:25:20,539][Main][INFO] - [train] Step 101100 out of 120000 | Loss --> 2.051 | Grad_l2 --> 0.403 | Weights_l2 --> 56830.432 | Lr --> 0.003 | Seconds_per_step --> 2.118 | [2024-04-24 07:28:53,742][Main][INFO] - [train] Step 101200 out of 120000 | Loss --> 2.047 | Grad_l2 --> 0.393 | Weights_l2 --> 56855.495 | Lr --> 0.003 | Seconds_per_step --> 2.132 | [2024-04-24 07:32:26,895][Main][INFO] - [train] Step 101300 out of 120000 | Loss --> 2.064 | Grad_l2 --> 0.403 | Weights_l2 --> 56880.392 | Lr --> 0.003 | Seconds_per_step --> 2.132 | [2024-04-24 07:35:58,075][Main][INFO] - [train] Step 101400 out of 120000 | Loss --> 2.049 | Grad_l2 --> 0.399 | Weights_l2 --> 56905.936 | Lr --> 0.003 | Seconds_per_step --> 2.112 | [2024-04-24 07:39:30,973][Main][INFO] - [train] Step 101500 out of 120000 | Loss --> 2.067 | Grad_l2 --> 0.400 | Weights_l2 --> 56931.199 | Lr --> 0.003 | Seconds_per_step --> 2.129 | [2024-04-24 07:43:02,440][Main][INFO] - [train] Step 101600 out of 120000 | Loss --> 2.061 | Grad_l2 --> 0.402 | Weights_l2 --> 56956.060 | Lr --> 0.003 | Seconds_per_step --> 2.115 | [2024-04-24 07:46:34,504][Main][INFO] - [train] Step 101700 out of 120000 | Loss --> 2.066 | Grad_l2 --> 0.395 | Weights_l2 --> 56981.226 | Lr --> 0.003 | Seconds_per_step --> 2.121 | [2024-04-24 07:50:06,639][Main][INFO] - [train] Step 101800 out of 120000 | Loss --> 2.053 | Grad_l2 --> 0.393 | Weights_l2 --> 57006.305 | Lr --> 0.003 | Seconds_per_step --> 2.121 | [2024-04-24 07:53:37,867][Main][INFO] - [train] Step 101900 out of 120000 | Loss --> 2.061 | Grad_l2 --> 0.410 | Weights_l2 --> 57031.232 | Lr --> 0.003 | Seconds_per_step --> 2.112 | [2024-04-24 07:57:08,822][Main][INFO] - [train] Step 102000 out of 120000 | Loss --> 2.059 | Grad_l2 --> 0.404 | Weights_l2 --> 57056.552 | Lr --> 0.003 | Seconds_per_step --> 2.110 | [2024-04-24 08:00:43,191][Main][INFO] - [train] Step 102100 out of 120000 | Loss --> 2.071 | Grad_l2 --> 0.405 | Weights_l2 --> 57081.856 | Lr --> 0.003 | Seconds_per_step --> 2.144 | [2024-04-24 08:04:13,071][Main][INFO] - [train] Step 102200 out of 120000 | Loss --> 2.066 | Grad_l2 --> 0.412 | Weights_l2 --> 57107.266 | Lr --> 0.003 | Seconds_per_step --> 2.099 | [2024-04-24 08:07:46,608][Main][INFO] - [train] Step 102300 out of 120000 | Loss --> 2.055 | Grad_l2 --> 0.415 | Weights_l2 --> 57132.448 | Lr --> 0.003 | Seconds_per_step --> 2.135 | [2024-04-24 08:11:18,038][Main][INFO] - [train] Step 102400 out of 120000 | Loss --> 2.046 | Grad_l2 --> 0.399 | Weights_l2 --> 57157.446 | Lr --> 0.003 | Seconds_per_step --> 2.114 | [2024-04-24 08:14:50,905][Main][INFO] - [train] Step 102500 out of 120000 | Loss --> 2.071 | Grad_l2 --> 0.404 | Weights_l2 --> 57182.473 | Lr --> 0.003 | Seconds_per_step --> 2.129 | [2024-04-24 08:18:19,740][Main][INFO] - [train] Step 102600 out of 120000 | Loss --> 2.041 | Grad_l2 --> 0.397 | Weights_l2 --> 57207.589 | Lr --> 0.003 | Seconds_per_step --> 2.088 | [2024-04-24 08:21:52,367][Main][INFO] - [train] Step 102700 out of 120000 | Loss --> 2.060 | Grad_l2 --> 0.399 | Weights_l2 --> 57232.537 | Lr --> 0.003 | Seconds_per_step --> 2.126 | [2024-04-24 08:25:24,366][Main][INFO] - [train] Step 102800 out of 120000 | Loss --> 2.048 | Grad_l2 --> 0.405 | Weights_l2 --> 57257.547 | Lr --> 0.003 | Seconds_per_step --> 2.120 | [2024-04-24 08:28:57,967][Main][INFO] - [train] Step 102900 out of 120000 | Loss --> 2.061 | Grad_l2 --> 0.402 | Weights_l2 --> 57282.648 | Lr --> 0.003 | Seconds_per_step --> 2.136 | [2024-04-24 08:32:29,367][Main][INFO] - [train] Step 103000 out of 120000 | Loss --> 2.048 | Grad_l2 --> 0.404 | Weights_l2 --> 57307.324 | Lr --> 0.003 | Seconds_per_step --> 2.114 | [2024-04-24 08:36:03,094][Main][INFO] - [train] Step 103100 out of 120000 | Loss --> 2.051 | Grad_l2 --> 0.392 | Weights_l2 --> 57332.034 | Lr --> 0.003 | Seconds_per_step --> 2.137 | [2024-04-24 08:39:35,007][Main][INFO] - [train] Step 103200 out of 120000 | Loss --> 2.031 | Grad_l2 --> 0.396 | Weights_l2 --> 57356.805 | Lr --> 0.003 | Seconds_per_step --> 2.119 | [2024-04-24 08:43:08,096][Main][INFO] - [train] Step 103300 out of 120000 | Loss --> 2.035 | Grad_l2 --> 0.405 | Weights_l2 --> 57381.531 | Lr --> 0.003 | Seconds_per_step --> 2.131 | [2024-04-24 08:46:37,196][Main][INFO] - [train] Step 103400 out of 120000 | Loss --> 2.031 | Grad_l2 --> 0.396 | Weights_l2 --> 57405.900 | Lr --> 0.003 | Seconds_per_step --> 2.091 | [2024-04-24 08:50:09,894][Main][INFO] - [train] Step 103500 out of 120000 | Loss --> 2.049 | Grad_l2 --> 0.399 | Weights_l2 --> 57430.915 | Lr --> 0.003 | Seconds_per_step --> 2.127 | [2024-04-24 08:53:42,238][Main][INFO] - [train] Step 103600 out of 120000 | Loss --> 2.043 | Grad_l2 --> 0.391 | Weights_l2 --> 57455.750 | Lr --> 0.003 | Seconds_per_step --> 2.123 | [2024-04-24 08:57:12,641][Main][INFO] - [train] Step 103700 out of 120000 | Loss --> 2.057 | Grad_l2 --> 0.407 | Weights_l2 --> 57480.855 | Lr --> 0.003 | Seconds_per_step --> 2.104 | [2024-04-24 09:00:44,470][Main][INFO] - [train] Step 103800 out of 120000 | Loss --> 2.045 | Grad_l2 --> 0.410 | Weights_l2 --> 57505.651 | Lr --> 0.003 | Seconds_per_step --> 2.118 | [2024-04-24 09:04:15,969][Main][INFO] - [train] Step 103900 out of 120000 | Loss --> 2.048 | Grad_l2 --> 0.398 | Weights_l2 --> 57530.610 | Lr --> 0.003 | Seconds_per_step --> 2.115 | [2024-04-24 09:07:47,294][Main][INFO] - [train] Step 104000 out of 120000 | Loss --> 2.056 | Grad_l2 --> 0.406 | Weights_l2 --> 57555.391 | Lr --> 0.003 | Seconds_per_step --> 2.113 | [2024-04-24 09:11:19,793][Main][INFO] - [train] Step 104100 out of 120000 | Loss --> 2.033 | Grad_l2 --> 0.410 | Weights_l2 --> 57580.646 | Lr --> 0.003 | Seconds_per_step --> 2.125 | [2024-04-24 09:14:54,078][Main][INFO] - [train] Step 104200 out of 120000 | Loss --> 2.056 | Grad_l2 --> 0.405 | Weights_l2 --> 57605.345 | Lr --> 0.003 | Seconds_per_step --> 2.143 | [2024-04-24 09:18:25,737][Main][INFO] - [train] Step 104300 out of 120000 | Loss --> 2.041 | Grad_l2 --> 0.401 | Weights_l2 --> 57630.189 | Lr --> 0.003 | Seconds_per_step --> 2.117 | [2024-04-24 09:21:57,584][Main][INFO] - [train] Step 104400 out of 120000 | Loss --> 2.048 | Grad_l2 --> 0.404 | Weights_l2 --> 57654.904 | Lr --> 0.003 | Seconds_per_step --> 2.118 | [2024-04-24 09:25:29,543][Main][INFO] - [train] Step 104500 out of 120000 | Loss --> 2.042 | Grad_l2 --> 0.402 | Weights_l2 --> 57679.268 | Lr --> 0.003 | Seconds_per_step --> 2.120 | [2024-04-24 09:29:00,438][Main][INFO] - [train] Step 104600 out of 120000 | Loss --> 2.026 | Grad_l2 --> 0.399 | Weights_l2 --> 57703.862 | Lr --> 0.003 | Seconds_per_step --> 2.109 | [2024-04-24 09:32:31,071][Main][INFO] - [train] Step 104700 out of 120000 | Loss --> 2.035 | Grad_l2 --> 0.409 | Weights_l2 --> 57728.824 | Lr --> 0.003 | Seconds_per_step --> 2.106 | [2024-04-24 09:36:03,475][Main][INFO] - [train] Step 104800 out of 120000 | Loss --> 2.033 | Grad_l2 --> 0.400 | Weights_l2 --> 57753.866 | Lr --> 0.003 | Seconds_per_step --> 2.124 | [2024-04-24 09:39:35,308][Main][INFO] - [train] Step 104900 out of 120000 | Loss --> 2.037 | Grad_l2 --> 0.396 | Weights_l2 --> 57777.998 | Lr --> 0.003 | Seconds_per_step --> 2.118 | [2024-04-24 09:43:06,003][Main][INFO] - [train] Step 105000 out of 120000 | Loss --> 2.036 | Grad_l2 --> 0.401 | Weights_l2 --> 57802.703 | Lr --> 0.003 | Seconds_per_step --> 2.107 | [2024-04-24 09:43:06,213][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-24 09:47:31,721][Main][INFO] - [eval] Step 105000 out of 120000 | Loss --> 1.902 | Accuracy --> 0.658 | Time --> 265.715 | [2024-04-24 09:51:06,349][Main][INFO] - [train] Step 105100 out of 120000 | Loss --> 2.034 | Grad_l2 --> 0.396 | Weights_l2 --> 57827.491 | Lr --> 0.003 | Seconds_per_step --> 2.146 | [2024-04-24 09:54:37,767][Main][INFO] - [train] Step 105200 out of 120000 | Loss --> 2.051 | Grad_l2 --> 0.395 | Weights_l2 --> 57852.329 | Lr --> 0.003 | Seconds_per_step --> 2.114 | [2024-04-24 09:58:10,500][Main][INFO] - [train] Step 105300 out of 120000 | Loss --> 2.039 | Grad_l2 --> 0.407 | Weights_l2 --> 57876.862 | Lr --> 0.003 | Seconds_per_step --> 2.127 | [2024-04-24 10:01:40,667][Main][INFO] - [train] Step 105400 out of 120000 | Loss --> 2.031 | Grad_l2 --> 0.397 | Weights_l2 --> 57901.287 | Lr --> 0.003 | Seconds_per_step --> 2.102 | [2024-04-24 10:05:12,465][Main][INFO] - [train] Step 105500 out of 120000 | Loss --> 2.031 | Grad_l2 --> 0.397 | Weights_l2 --> 57925.976 | Lr --> 0.003 | Seconds_per_step --> 2.118 | [2024-04-24 10:08:45,660][Main][INFO] - [train] Step 105600 out of 120000 | Loss --> 2.022 | Grad_l2 --> 0.393 | Weights_l2 --> 57951.078 | Lr --> 0.003 | Seconds_per_step --> 2.132 | [2024-04-24 10:12:16,839][Main][INFO] - [train] Step 105700 out of 120000 | Loss --> 2.026 | Grad_l2 --> 0.409 | Weights_l2 --> 57975.415 | Lr --> 0.003 | Seconds_per_step --> 2.112 | [2024-04-24 10:15:49,338][Main][INFO] - [train] Step 105800 out of 120000 | Loss --> 2.041 | Grad_l2 --> 0.404 | Weights_l2 --> 57999.590 | Lr --> 0.003 | Seconds_per_step --> 2.125 | [2024-04-24 10:19:19,965][Main][INFO] - [train] Step 105900 out of 120000 | Loss --> 2.033 | Grad_l2 --> 0.395 | Weights_l2 --> 58023.812 | Lr --> 0.003 | Seconds_per_step --> 2.106 | [2024-04-24 10:22:52,038][Main][INFO] - [train] Step 106000 out of 120000 | Loss --> 2.028 | Grad_l2 --> 0.408 | Weights_l2 --> 58048.089 | Lr --> 0.003 | Seconds_per_step --> 2.121 | [2024-04-24 10:26:23,666][Main][INFO] - [train] Step 106100 out of 120000 | Loss --> 2.036 | Grad_l2 --> 0.413 | Weights_l2 --> 58072.590 | Lr --> 0.003 | Seconds_per_step --> 2.116 | [2024-04-24 10:29:57,287][Main][INFO] - [train] Step 106200 out of 120000 | Loss --> 2.044 | Grad_l2 --> 0.411 | Weights_l2 --> 58096.614 | Lr --> 0.003 | Seconds_per_step --> 2.136 | [2024-04-24 10:33:26,813][Main][INFO] - [train] Step 106300 out of 120000 | Loss --> 2.028 | Grad_l2 --> 0.399 | Weights_l2 --> 58121.150 | Lr --> 0.003 | Seconds_per_step --> 2.095 | [2024-04-24 10:36:58,539][Main][INFO] - [train] Step 106400 out of 120000 | Loss --> 2.041 | Grad_l2 --> 0.403 | Weights_l2 --> 58145.236 | Lr --> 0.003 | Seconds_per_step --> 2.117 | [2024-04-24 10:40:29,571][Main][INFO] - [train] Step 106500 out of 120000 | Loss --> 2.059 | Grad_l2 --> 0.408 | Weights_l2 --> 58169.918 | Lr --> 0.003 | Seconds_per_step --> 2.110 | [2024-04-24 10:44:04,239][Main][INFO] - [train] Step 106600 out of 120000 | Loss --> 2.048 | Grad_l2 --> 0.398 | Weights_l2 --> 58194.348 | Lr --> 0.003 | Seconds_per_step --> 2.147 | [2024-04-24 10:47:36,938][Main][INFO] - [train] Step 106700 out of 120000 | Loss --> 2.047 | Grad_l2 --> 0.409 | Weights_l2 --> 58218.803 | Lr --> 0.003 | Seconds_per_step --> 2.127 | [2024-04-24 10:51:06,938][Main][INFO] - [train] Step 106800 out of 120000 | Loss --> 2.031 | Grad_l2 --> 0.403 | Weights_l2 --> 58243.433 | Lr --> 0.003 | Seconds_per_step --> 2.100 | [2024-04-24 10:54:38,269][Main][INFO] - [train] Step 106900 out of 120000 | Loss --> 2.040 | Grad_l2 --> 0.432 | Weights_l2 --> 58268.061 | Lr --> 0.003 | Seconds_per_step --> 2.113 | [2024-04-24 10:58:09,842][Main][INFO] - [train] Step 107000 out of 120000 | Loss --> 2.040 | Grad_l2 --> 0.401 | Weights_l2 --> 58292.263 | Lr --> 0.003 | Seconds_per_step --> 2.116 | [2024-04-24 11:01:42,338][Main][INFO] - [train] Step 107100 out of 120000 | Loss --> 2.058 | Grad_l2 --> 0.394 | Weights_l2 --> 58316.856 | Lr --> 0.003 | Seconds_per_step --> 2.125 | [2024-04-24 11:05:14,566][Main][INFO] - [train] Step 107200 out of 120000 | Loss --> 2.034 | Grad_l2 --> 0.401 | Weights_l2 --> 58341.163 | Lr --> 0.003 | Seconds_per_step --> 2.122 | [2024-04-24 11:08:43,889][Main][INFO] - [train] Step 107300 out of 120000 | Loss --> 2.055 | Grad_l2 --> 0.412 | Weights_l2 --> 58365.073 | Lr --> 0.003 | Seconds_per_step --> 2.093 | [2024-04-24 11:12:17,141][Main][INFO] - [train] Step 107400 out of 120000 | Loss --> 2.042 | Grad_l2 --> 0.401 | Weights_l2 --> 58389.136 | Lr --> 0.003 | Seconds_per_step --> 2.133 | [2024-04-24 11:15:49,914][Main][INFO] - [train] Step 107500 out of 120000 | Loss --> 2.038 | Grad_l2 --> 0.394 | Weights_l2 --> 58412.950 | Lr --> 0.003 | Seconds_per_step --> 2.128 | [2024-04-24 11:19:21,474][Main][INFO] - [train] Step 107600 out of 120000 | Loss --> 2.054 | Grad_l2 --> 0.409 | Weights_l2 --> 58437.504 | Lr --> 0.003 | Seconds_per_step --> 2.116 | [2024-04-24 11:22:54,378][Main][INFO] - [train] Step 107700 out of 120000 | Loss --> 2.030 | Grad_l2 --> 0.400 | Weights_l2 --> 58461.771 | Lr --> 0.003 | Seconds_per_step --> 2.129 | [2024-04-24 11:26:25,568][Main][INFO] - [train] Step 107800 out of 120000 | Loss --> 2.041 | Grad_l2 --> 0.406 | Weights_l2 --> 58486.327 | Lr --> 0.003 | Seconds_per_step --> 2.112 | [2024-04-24 11:29:58,072][Main][INFO] - [train] Step 107900 out of 120000 | Loss --> 2.042 | Grad_l2 --> 0.405 | Weights_l2 --> 58510.364 | Lr --> 0.003 | Seconds_per_step --> 2.125 | [2024-04-24 11:33:31,366][Main][INFO] - [train] Step 108000 out of 120000 | Loss --> 2.018 | Grad_l2 --> 0.418 | Weights_l2 --> 58534.272 | Lr --> 0.003 | Seconds_per_step --> 2.133 | [2024-04-24 11:37:05,241][Main][INFO] - [train] Step 108100 out of 120000 | Loss --> 2.024 | Grad_l2 --> 0.406 | Weights_l2 --> 58558.531 | Lr --> 0.003 | Seconds_per_step --> 2.139 | [2024-04-24 11:40:37,996][Main][INFO] - [train] Step 108200 out of 120000 | Loss --> 2.034 | Grad_l2 --> 0.408 | Weights_l2 --> 58581.979 | Lr --> 0.003 | Seconds_per_step --> 2.128 | [2024-04-24 11:44:08,151][Main][INFO] - [train] Step 108300 out of 120000 | Loss --> 2.052 | Grad_l2 --> 0.413 | Weights_l2 --> 58605.329 | Lr --> 0.003 | Seconds_per_step --> 2.102 | [2024-04-24 11:47:40,188][Main][INFO] - [train] Step 108400 out of 120000 | Loss --> 2.038 | Grad_l2 --> 0.413 | Weights_l2 --> 58628.055 | Lr --> 0.003 | Seconds_per_step --> 2.120 | [2024-04-24 11:51:12,892][Main][INFO] - [train] Step 108500 out of 120000 | Loss --> 2.057 | Grad_l2 --> 0.407 | Weights_l2 --> 58650.442 | Lr --> 0.003 | Seconds_per_step --> 2.127 | [2024-04-24 11:54:44,584][Main][INFO] - [train] Step 108600 out of 120000 | Loss --> 2.058 | Grad_l2 --> 0.409 | Weights_l2 --> 58672.014 | Lr --> 0.003 | Seconds_per_step --> 2.117 | [2024-04-24 11:58:17,086][Main][INFO] - [train] Step 108700 out of 120000 | Loss --> 2.054 | Grad_l2 --> 0.402 | Weights_l2 --> 58693.736 | Lr --> 0.003 | Seconds_per_step --> 2.125 | [2024-04-24 12:01:47,542][Main][INFO] - [train] Step 108800 out of 120000 | Loss --> 2.050 | Grad_l2 --> 0.395 | Weights_l2 --> 58714.562 | Lr --> 0.003 | Seconds_per_step --> 2.105 | [2024-04-24 12:05:19,902][Main][INFO] - [train] Step 108900 out of 120000 | Loss --> 2.054 | Grad_l2 --> 0.406 | Weights_l2 --> 58735.450 | Lr --> 0.003 | Seconds_per_step --> 2.124 | [2024-04-24 12:08:50,568][Main][INFO] - [train] Step 109000 out of 120000 | Loss --> 2.063 | Grad_l2 --> 0.396 | Weights_l2 --> 58756.084 | Lr --> 0.003 | Seconds_per_step --> 2.107 | [2024-04-24 12:12:24,074][Main][INFO] - [train] Step 109100 out of 120000 | Loss --> 2.050 | Grad_l2 --> 0.396 | Weights_l2 --> 58775.954 | Lr --> 0.003 | Seconds_per_step --> 2.135 | [2024-04-24 12:15:54,395][Main][INFO] - [train] Step 109200 out of 120000 | Loss --> 2.044 | Grad_l2 --> 0.409 | Weights_l2 --> 58795.684 | Lr --> 0.003 | Seconds_per_step --> 2.103 | [2024-04-24 12:19:25,665][Main][INFO] - [train] Step 109300 out of 120000 | Loss --> 2.041 | Grad_l2 --> 0.386 | Weights_l2 --> 58814.774 | Lr --> 0.003 | Seconds_per_step --> 2.113 | [2024-04-24 12:22:55,793][Main][INFO] - [train] Step 109400 out of 120000 | Loss --> 2.043 | Grad_l2 --> 0.400 | Weights_l2 --> 58833.597 | Lr --> 0.003 | Seconds_per_step --> 2.101 | [2024-04-24 12:26:27,605][Main][INFO] - [train] Step 109500 out of 120000 | Loss --> 2.025 | Grad_l2 --> 0.440 | Weights_l2 --> 58852.248 | Lr --> 0.003 | Seconds_per_step --> 2.118 | [2024-04-24 12:30:01,195][Main][INFO] - [train] Step 109600 out of 120000 | Loss --> 2.024 | Grad_l2 --> 0.391 | Weights_l2 --> 58870.410 | Lr --> 0.003 | Seconds_per_step --> 2.136 | [2024-04-24 12:33:32,505][Main][INFO] - [train] Step 109700 out of 120000 | Loss --> 2.018 | Grad_l2 --> 0.397 | Weights_l2 --> 58888.076 | Lr --> 0.003 | Seconds_per_step --> 2.113 | [2024-04-24 12:37:04,409][Main][INFO] - [train] Step 109800 out of 120000 | Loss --> 2.023 | Grad_l2 --> 0.399 | Weights_l2 --> 58905.595 | Lr --> 0.003 | Seconds_per_step --> 2.119 | [2024-04-24 12:40:38,582][Main][INFO] - [train] Step 109900 out of 120000 | Loss --> 2.015 | Grad_l2 --> 0.398 | Weights_l2 --> 58922.873 | Lr --> 0.003 | Seconds_per_step --> 2.142 | [2024-04-24 12:44:09,694][Main][INFO] - [train] Step 110000 out of 120000 | Loss --> 2.037 | Grad_l2 --> 0.397 | Weights_l2 --> 58939.659 | Lr --> 0.003 | Seconds_per_step --> 2.111 | [2024-04-24 12:44:09,924][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-24 12:48:36,794][Main][INFO] - [eval] Step 110000 out of 120000 | Loss --> 1.877 | Accuracy --> 0.660 | Time --> 267.098 | [2024-04-24 12:48:36,797][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-110000 [2024-04-24 12:48:36,801][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-04-24 12:48:40,098][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-110000/model.safetensors [2024-04-24 12:48:40,161][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-110000/optimizer.bin [2024-04-24 12:48:40,162][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-110000/scheduler.bin [2024-04-24 12:48:40,162][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-110000/sampler.bin [2024-04-24 12:48:40,162][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-110000/sampler_1.bin [2024-04-24 12:48:40,164][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-110000/random_states_0.pkl [2024-04-24 12:52:11,558][Main][INFO] - [train] Step 110100 out of 120000 | Loss --> 2.023 | Grad_l2 --> 0.394 | Weights_l2 --> 58956.067 | Lr --> 0.003 | Seconds_per_step --> 2.148 | [2024-04-24 12:55:43,893][Main][INFO] - [train] Step 110200 out of 120000 | Loss --> 2.015 | Grad_l2 --> 0.398 | Weights_l2 --> 58972.429 | Lr --> 0.002 | Seconds_per_step --> 2.123 | [2024-04-24 12:59:16,788][Main][INFO] - [train] Step 110300 out of 120000 | Loss --> 2.023 | Grad_l2 --> 0.398 | Weights_l2 --> 58988.308 | Lr --> 0.002 | Seconds_per_step --> 2.129 | [2024-04-24 13:02:50,396][Main][INFO] - [train] Step 110400 out of 120000 | Loss --> 2.029 | Grad_l2 --> 0.392 | Weights_l2 --> 59003.687 | Lr --> 0.002 | Seconds_per_step --> 2.136 | [2024-04-24 13:06:21,066][Main][INFO] - [train] Step 110500 out of 120000 | Loss --> 2.023 | Grad_l2 --> 0.397 | Weights_l2 --> 59019.030 | Lr --> 0.002 | Seconds_per_step --> 2.107 | [2024-04-24 13:09:54,566][Main][INFO] - [train] Step 110600 out of 120000 | Loss --> 2.016 | Grad_l2 --> 0.391 | Weights_l2 --> 59033.808 | Lr --> 0.002 | Seconds_per_step --> 2.135 | [2024-04-24 13:13:23,894][Main][INFO] - [train] Step 110700 out of 120000 | Loss --> 2.015 | Grad_l2 --> 0.398 | Weights_l2 --> 59048.019 | Lr --> 0.002 | Seconds_per_step --> 2.093 | [2024-04-24 13:16:55,538][Main][INFO] - [train] Step 110800 out of 120000 | Loss --> 2.004 | Grad_l2 --> 0.393 | Weights_l2 --> 59062.285 | Lr --> 0.002 | Seconds_per_step --> 2.116 | [2024-04-24 13:20:26,208][Main][INFO] - [train] Step 110900 out of 120000 | Loss --> 2.014 | Grad_l2 --> 0.411 | Weights_l2 --> 59076.193 | Lr --> 0.002 | Seconds_per_step --> 2.107 | [2024-04-24 13:23:58,266][Main][INFO] - [train] Step 111000 out of 120000 | Loss --> 2.010 | Grad_l2 --> 0.401 | Weights_l2 --> 59089.913 | Lr --> 0.002 | Seconds_per_step --> 2.121 | [2024-04-24 13:27:29,463][Main][INFO] - [train] Step 111100 out of 120000 | Loss --> 2.007 | Grad_l2 --> 0.398 | Weights_l2 --> 59103.294 | Lr --> 0.002 | Seconds_per_step --> 2.112 | [2024-04-24 13:31:01,238][Main][INFO] - [train] Step 111200 out of 120000 | Loss --> 2.014 | Grad_l2 --> 0.383 | Weights_l2 --> 59116.496 | Lr --> 0.002 | Seconds_per_step --> 2.118 | [2024-04-24 13:34:33,243][Main][INFO] - [train] Step 111300 out of 120000 | Loss --> 1.999 | Grad_l2 --> 0.398 | Weights_l2 --> 59129.301 | Lr --> 0.002 | Seconds_per_step --> 2.120 | [2024-04-24 13:38:04,368][Main][INFO] - [train] Step 111400 out of 120000 | Loss --> 2.010 | Grad_l2 --> 0.416 | Weights_l2 --> 59141.700 | Lr --> 0.002 | Seconds_per_step --> 2.111 | [2024-04-24 13:41:37,166][Main][INFO] - [train] Step 111500 out of 120000 | Loss --> 2.007 | Grad_l2 --> 0.397 | Weights_l2 --> 59154.158 | Lr --> 0.002 | Seconds_per_step --> 2.128 | [2024-04-24 13:45:11,269][Main][INFO] - [train] Step 111600 out of 120000 | Loss --> 2.000 | Grad_l2 --> 0.405 | Weights_l2 --> 59166.151 | Lr --> 0.002 | Seconds_per_step --> 2.141 | [2024-04-24 13:48:42,137][Main][INFO] - [train] Step 111700 out of 120000 | Loss --> 2.011 | Grad_l2 --> 0.398 | Weights_l2 --> 59177.601 | Lr --> 0.002 | Seconds_per_step --> 2.109 | [2024-04-24 13:52:14,783][Main][INFO] - [train] Step 111800 out of 120000 | Loss --> 1.997 | Grad_l2 --> 0.395 | Weights_l2 --> 59188.897 | Lr --> 0.002 | Seconds_per_step --> 2.126 | [2024-04-24 13:55:48,797][Main][INFO] - [train] Step 111900 out of 120000 | Loss --> 1.998 | Grad_l2 --> 0.387 | Weights_l2 --> 59200.255 | Lr --> 0.002 | Seconds_per_step --> 2.140 | [2024-04-24 13:59:19,777][Main][INFO] - [train] Step 112000 out of 120000 | Loss --> 1.984 | Grad_l2 --> 0.387 | Weights_l2 --> 59211.161 | Lr --> 0.002 | Seconds_per_step --> 2.110 | [2024-04-24 14:02:48,438][Main][INFO] - [train] Step 112100 out of 120000 | Loss --> 1.994 | Grad_l2 --> 0.394 | Weights_l2 --> 59221.774 | Lr --> 0.002 | Seconds_per_step --> 2.087 | [2024-04-24 14:06:19,742][Main][INFO] - [train] Step 112200 out of 120000 | Loss --> 1.993 | Grad_l2 --> 0.384 | Weights_l2 --> 59232.303 | Lr --> 0.002 | Seconds_per_step --> 2.113 | [2024-04-24 14:09:52,194][Main][INFO] - [train] Step 112300 out of 120000 | Loss --> 2.002 | Grad_l2 --> 0.389 | Weights_l2 --> 59242.423 | Lr --> 0.002 | Seconds_per_step --> 2.124 | [2024-04-24 14:13:20,759][Main][INFO] - [train] Step 112400 out of 120000 | Loss --> 1.994 | Grad_l2 --> 0.390 | Weights_l2 --> 59252.259 | Lr --> 0.002 | Seconds_per_step --> 2.086 | [2024-04-24 14:16:51,877][Main][INFO] - [train] Step 112500 out of 120000 | Loss --> 2.005 | Grad_l2 --> 0.393 | Weights_l2 --> 59261.790 | Lr --> 0.002 | Seconds_per_step --> 2.111 | [2024-04-24 14:20:27,493][Main][INFO] - [train] Step 112600 out of 120000 | Loss --> 1.995 | Grad_l2 --> 0.388 | Weights_l2 --> 59271.204 | Lr --> 0.002 | Seconds_per_step --> 2.156 | [2024-04-24 14:23:58,738][Main][INFO] - [train] Step 112700 out of 120000 | Loss --> 2.003 | Grad_l2 --> 0.393 | Weights_l2 --> 59280.283 | Lr --> 0.002 | Seconds_per_step --> 2.112 | [2024-04-24 14:27:29,443][Main][INFO] - [train] Step 112800 out of 120000 | Loss --> 1.995 | Grad_l2 --> 0.382 | Weights_l2 --> 59289.036 | Lr --> 0.002 | Seconds_per_step --> 2.107 | [2024-04-24 14:31:00,938][Main][INFO] - [train] Step 112900 out of 120000 | Loss --> 2.013 | Grad_l2 --> 0.389 | Weights_l2 --> 59297.639 | Lr --> 0.002 | Seconds_per_step --> 2.115 | [2024-04-24 14:34:33,642][Main][INFO] - [train] Step 113000 out of 120000 | Loss --> 2.005 | Grad_l2 --> 0.393 | Weights_l2 --> 59305.970 | Lr --> 0.002 | Seconds_per_step --> 2.127 | [2024-04-24 14:38:05,750][Main][INFO] - [train] Step 113100 out of 120000 | Loss --> 1.985 | Grad_l2 --> 0.389 | Weights_l2 --> 59314.225 | Lr --> 0.002 | Seconds_per_step --> 2.121 | [2024-04-24 14:41:38,740][Main][INFO] - [train] Step 113200 out of 120000 | Loss --> 1.993 | Grad_l2 --> 0.381 | Weights_l2 --> 59322.173 | Lr --> 0.002 | Seconds_per_step --> 2.130 | [2024-04-24 14:45:11,899][Main][INFO] - [train] Step 113300 out of 120000 | Loss --> 1.992 | Grad_l2 --> 0.390 | Weights_l2 --> 59329.666 | Lr --> 0.002 | Seconds_per_step --> 2.132 | [2024-04-24 14:48:43,046][Main][INFO] - [train] Step 113400 out of 120000 | Loss --> 2.007 | Grad_l2 --> 0.390 | Weights_l2 --> 59337.106 | Lr --> 0.002 | Seconds_per_step --> 2.111 | [2024-04-24 14:52:11,938][Main][INFO] - [train] Step 113500 out of 120000 | Loss --> 1.995 | Grad_l2 --> 0.388 | Weights_l2 --> 59344.301 | Lr --> 0.002 | Seconds_per_step --> 2.089 | [2024-04-24 14:55:44,637][Main][INFO] - [train] Step 113600 out of 120000 | Loss --> 1.990 | Grad_l2 --> 0.381 | Weights_l2 --> 59351.167 | Lr --> 0.002 | Seconds_per_step --> 2.127 | [2024-04-24 14:59:16,845][Main][INFO] - [train] Step 113700 out of 120000 | Loss --> 2.000 | Grad_l2 --> 0.385 | Weights_l2 --> 59357.866 | Lr --> 0.002 | Seconds_per_step --> 2.122 | [2024-04-24 15:02:46,638][Main][INFO] - [train] Step 113800 out of 120000 | Loss --> 1.989 | Grad_l2 --> 0.387 | Weights_l2 --> 59364.343 | Lr --> 0.002 | Seconds_per_step --> 2.098 | [2024-04-24 15:06:19,215][Main][INFO] - [train] Step 113900 out of 120000 | Loss --> 2.002 | Grad_l2 --> 0.385 | Weights_l2 --> 59370.618 | Lr --> 0.002 | Seconds_per_step --> 2.126 | [2024-04-24 15:09:52,275][Main][INFO] - [train] Step 114000 out of 120000 | Loss --> 2.010 | Grad_l2 --> 0.389 | Weights_l2 --> 59376.756 | Lr --> 0.002 | Seconds_per_step --> 2.131 | [2024-04-24 15:13:22,410][Main][INFO] - [train] Step 114100 out of 120000 | Loss --> 1.994 | Grad_l2 --> 0.392 | Weights_l2 --> 59382.520 | Lr --> 0.001 | Seconds_per_step --> 2.101 | [2024-04-24 15:16:54,989][Main][INFO] - [train] Step 114200 out of 120000 | Loss --> 1.997 | Grad_l2 --> 0.387 | Weights_l2 --> 59388.180 | Lr --> 0.001 | Seconds_per_step --> 2.126 | [2024-04-24 15:20:31,988][Main][INFO] - [train] Step 114300 out of 120000 | Loss --> 2.010 | Grad_l2 --> 0.390 | Weights_l2 --> 59393.524 | Lr --> 0.001 | Seconds_per_step --> 2.170 | [2024-04-24 15:24:00,099][Main][INFO] - [train] Step 114400 out of 120000 | Loss --> 1.970 | Grad_l2 --> 0.380 | Weights_l2 --> 59398.755 | Lr --> 0.001 | Seconds_per_step --> 2.081 | [2024-04-24 15:27:32,847][Main][INFO] - [train] Step 114500 out of 120000 | Loss --> 1.981 | Grad_l2 --> 0.382 | Weights_l2 --> 59403.568 | Lr --> 0.001 | Seconds_per_step --> 2.127 | [2024-04-24 15:31:01,741][Main][INFO] - [train] Step 114600 out of 120000 | Loss --> 1.996 | Grad_l2 --> 0.379 | Weights_l2 --> 59408.368 | Lr --> 0.001 | Seconds_per_step --> 2.089 | [2024-04-24 15:34:32,472][Main][INFO] - [train] Step 114700 out of 120000 | Loss --> 1.988 | Grad_l2 --> 0.387 | Weights_l2 --> 59413.187 | Lr --> 0.001 | Seconds_per_step --> 2.107 | [2024-04-24 15:38:03,352][Main][INFO] - [train] Step 114800 out of 120000 | Loss --> 1.995 | Grad_l2 --> 0.384 | Weights_l2 --> 59417.553 | Lr --> 0.001 | Seconds_per_step --> 2.109 | [2024-04-24 15:41:35,396][Main][INFO] - [train] Step 114900 out of 120000 | Loss --> 1.993 | Grad_l2 --> 0.392 | Weights_l2 --> 59421.937 | Lr --> 0.001 | Seconds_per_step --> 2.120 | [2024-04-24 15:45:09,952][Main][INFO] - [train] Step 115000 out of 120000 | Loss --> 2.009 | Grad_l2 --> 0.382 | Weights_l2 --> 59426.219 | Lr --> 0.001 | Seconds_per_step --> 2.146 | [2024-04-24 15:45:10,151][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-24 15:49:32,696][Main][INFO] - [eval] Step 115000 out of 120000 | Loss --> 1.828 | Accuracy --> 0.668 | Time --> 262.742 | [2024-04-24 15:53:06,092][Main][INFO] - [train] Step 115100 out of 120000 | Loss --> 1.988 | Grad_l2 --> 0.378 | Weights_l2 --> 59430.231 | Lr --> 0.001 | Seconds_per_step --> 2.134 | [2024-04-24 15:56:38,338][Main][INFO] - [train] Step 115200 out of 120000 | Loss --> 1.979 | Grad_l2 --> 0.378 | Weights_l2 --> 59434.167 | Lr --> 0.001 | Seconds_per_step --> 2.122 | [2024-04-24 16:00:08,997][Main][INFO] - [train] Step 115300 out of 120000 | Loss --> 1.979 | Grad_l2 --> 0.379 | Weights_l2 --> 59437.792 | Lr --> 0.001 | Seconds_per_step --> 2.107 | [2024-04-24 16:03:39,847][Main][INFO] - [train] Step 115400 out of 120000 | Loss --> 1.981 | Grad_l2 --> 0.379 | Weights_l2 --> 59441.456 | Lr --> 0.001 | Seconds_per_step --> 2.108 | [2024-04-24 16:07:12,071][Main][INFO] - [train] Step 115500 out of 120000 | Loss --> 1.978 | Grad_l2 --> 0.369 | Weights_l2 --> 59444.765 | Lr --> 0.001 | Seconds_per_step --> 2.122 | [2024-04-24 16:10:40,368][Main][INFO] - [train] Step 115600 out of 120000 | Loss --> 1.983 | Grad_l2 --> 0.380 | Weights_l2 --> 59447.883 | Lr --> 0.001 | Seconds_per_step --> 2.083 | [2024-04-24 16:14:11,906][Main][INFO] - [train] Step 115700 out of 120000 | Loss --> 1.967 | Grad_l2 --> 0.376 | Weights_l2 --> 59450.891 | Lr --> 0.001 | Seconds_per_step --> 2.115 | [2024-04-24 16:17:42,777][Main][INFO] - [train] Step 115800 out of 120000 | Loss --> 1.972 | Grad_l2 --> 0.376 | Weights_l2 --> 59453.813 | Lr --> 0.001 | Seconds_per_step --> 2.109 | [2024-04-24 16:21:15,637][Main][INFO] - [train] Step 115900 out of 120000 | Loss --> 1.978 | Grad_l2 --> 0.379 | Weights_l2 --> 59456.701 | Lr --> 0.001 | Seconds_per_step --> 2.129 | [2024-04-24 16:24:46,537][Main][INFO] - [train] Step 116000 out of 120000 | Loss --> 1.972 | Grad_l2 --> 0.385 | Weights_l2 --> 59459.487 | Lr --> 0.001 | Seconds_per_step --> 2.109 | [2024-04-24 16:28:18,784][Main][INFO] - [train] Step 116100 out of 120000 | Loss --> 1.982 | Grad_l2 --> 0.377 | Weights_l2 --> 59461.979 | Lr --> 0.001 | Seconds_per_step --> 2.122 | [2024-04-24 16:31:50,538][Main][INFO] - [train] Step 116200 out of 120000 | Loss --> 1.980 | Grad_l2 --> 0.374 | Weights_l2 --> 59464.502 | Lr --> 0.001 | Seconds_per_step --> 2.118 | [2024-04-24 16:35:20,187][Main][INFO] - [train] Step 116300 out of 120000 | Loss --> 1.984 | Grad_l2 --> 0.372 | Weights_l2 --> 59466.701 | Lr --> 0.001 | Seconds_per_step --> 2.096 | [2024-04-24 16:38:53,963][Main][INFO] - [train] Step 116400 out of 120000 | Loss --> 1.981 | Grad_l2 --> 0.383 | Weights_l2 --> 59468.777 | Lr --> 0.001 | Seconds_per_step --> 2.138 | [2024-04-24 16:42:25,198][Main][INFO] - [train] Step 116500 out of 120000 | Loss --> 1.976 | Grad_l2 --> 0.370 | Weights_l2 --> 59470.810 | Lr --> 0.001 | Seconds_per_step --> 2.112 | [2024-04-24 16:45:56,168][Main][INFO] - [train] Step 116600 out of 120000 | Loss --> 1.958 | Grad_l2 --> 0.373 | Weights_l2 --> 59472.728 | Lr --> 0.001 | Seconds_per_step --> 2.110 | [2024-04-24 16:49:31,938][Main][INFO] - [train] Step 116700 out of 120000 | Loss --> 1.966 | Grad_l2 --> 0.381 | Weights_l2 --> 59474.472 | Lr --> 0.001 | Seconds_per_step --> 2.158 | [2024-04-24 16:53:00,738][Main][INFO] - [train] Step 116800 out of 120000 | Loss --> 1.958 | Grad_l2 --> 0.378 | Weights_l2 --> 59476.100 | Lr --> 0.001 | Seconds_per_step --> 2.088 | [2024-04-24 16:56:35,361][Main][INFO] - [train] Step 116900 out of 120000 | Loss --> 1.962 | Grad_l2 --> 0.376 | Weights_l2 --> 59477.835 | Lr --> 0.001 | Seconds_per_step --> 2.146 | [2024-04-24 17:00:04,766][Main][INFO] - [train] Step 117000 out of 120000 | Loss --> 1.962 | Grad_l2 --> 0.389 | Weights_l2 --> 59479.310 | Lr --> 0.001 | Seconds_per_step --> 2.094 | [2024-04-24 17:03:34,638][Main][INFO] - [train] Step 117100 out of 120000 | Loss --> 1.965 | Grad_l2 --> 0.375 | Weights_l2 --> 59480.671 | Lr --> 0.001 | Seconds_per_step --> 2.099 | [2024-04-24 17:07:06,667][Main][INFO] - [train] Step 117200 out of 120000 | Loss --> 1.954 | Grad_l2 --> 0.380 | Weights_l2 --> 59481.976 | Lr --> 0.001 | Seconds_per_step --> 2.120 | [2024-04-24 17:10:38,666][Main][INFO] - [train] Step 117300 out of 120000 | Loss --> 1.957 | Grad_l2 --> 0.374 | Weights_l2 --> 59483.186 | Lr --> 0.001 | Seconds_per_step --> 2.120 | [2024-04-24 17:14:08,939][Main][INFO] - [train] Step 117400 out of 120000 | Loss --> 1.959 | Grad_l2 --> 0.373 | Weights_l2 --> 59484.299 | Lr --> 0.001 | Seconds_per_step --> 2.103 | [2024-04-24 17:17:40,895][Main][INFO] - [train] Step 117500 out of 120000 | Loss --> 1.945 | Grad_l2 --> 0.369 | Weights_l2 --> 59485.339 | Lr --> 0.001 | Seconds_per_step --> 2.120 | [2024-04-24 17:21:13,546][Main][INFO] - [train] Step 117600 out of 120000 | Loss --> 1.955 | Grad_l2 --> 0.377 | Weights_l2 --> 59486.272 | Lr --> 0.001 | Seconds_per_step --> 2.126 | [2024-04-24 17:24:45,583][Main][INFO] - [train] Step 117700 out of 120000 | Loss --> 1.968 | Grad_l2 --> 0.389 | Weights_l2 --> 59487.141 | Lr --> 0.001 | Seconds_per_step --> 2.120 | [2024-04-24 17:28:18,552][Main][INFO] - [train] Step 117800 out of 120000 | Loss --> 1.979 | Grad_l2 --> 0.374 | Weights_l2 --> 59487.933 | Lr --> 0.001 | Seconds_per_step --> 2.130 | [2024-04-24 17:31:50,796][Main][INFO] - [train] Step 117900 out of 120000 | Loss --> 1.980 | Grad_l2 --> 0.374 | Weights_l2 --> 59488.663 | Lr --> 0.001 | Seconds_per_step --> 2.122 | [2024-04-24 17:35:20,476][Main][INFO] - [train] Step 118000 out of 120000 | Loss --> 1.977 | Grad_l2 --> 0.379 | Weights_l2 --> 59489.285 | Lr --> 0.001 | Seconds_per_step --> 2.097 | [2024-04-24 17:38:53,074][Main][INFO] - [train] Step 118100 out of 120000 | Loss --> 1.973 | Grad_l2 --> 0.367 | Weights_l2 --> 59489.870 | Lr --> 0.000 | Seconds_per_step --> 2.126 | [2024-04-24 17:42:25,395][Main][INFO] - [train] Step 118200 out of 120000 | Loss --> 1.969 | Grad_l2 --> 0.378 | Weights_l2 --> 59490.389 | Lr --> 0.000 | Seconds_per_step --> 2.123 | [2024-04-24 17:45:58,869][Main][INFO] - [train] Step 118300 out of 120000 | Loss --> 1.974 | Grad_l2 --> 0.370 | Weights_l2 --> 59490.827 | Lr --> 0.000 | Seconds_per_step --> 2.135 | [2024-04-24 17:49:34,151][Main][INFO] - [train] Step 118400 out of 120000 | Loss --> 1.976 | Grad_l2 --> 0.372 | Weights_l2 --> 59491.222 | Lr --> 0.000 | Seconds_per_step --> 2.153 | [2024-04-24 17:53:04,710][Main][INFO] - [train] Step 118500 out of 120000 | Loss --> 1.959 | Grad_l2 --> 0.375 | Weights_l2 --> 59491.592 | Lr --> 0.000 | Seconds_per_step --> 2.106 | [2024-04-24 17:56:40,776][Main][INFO] - [train] Step 118600 out of 120000 | Loss --> 1.980 | Grad_l2 --> 0.370 | Weights_l2 --> 59491.911 | Lr --> 0.000 | Seconds_per_step --> 2.161 | [2024-04-24 18:00:09,539][Main][INFO] - [train] Step 118700 out of 120000 | Loss --> 1.965 | Grad_l2 --> 0.366 | Weights_l2 --> 59492.154 | Lr --> 0.000 | Seconds_per_step --> 2.088 | [2024-04-24 18:03:42,568][Main][INFO] - [train] Step 118800 out of 120000 | Loss --> 1.960 | Grad_l2 --> 0.371 | Weights_l2 --> 59492.425 | Lr --> 0.000 | Seconds_per_step --> 2.130 | [2024-04-24 18:07:13,137][Main][INFO] - [train] Step 118900 out of 120000 | Loss --> 1.955 | Grad_l2 --> 0.365 | Weights_l2 --> 59492.572 | Lr --> 0.000 | Seconds_per_step --> 2.106 | [2024-04-24 18:10:46,670][Main][INFO] - [train] Step 119000 out of 120000 | Loss --> 1.953 | Grad_l2 --> 0.366 | Weights_l2 --> 59492.721 | Lr --> 0.000 | Seconds_per_step --> 2.135 | [2024-04-24 18:14:15,771][Main][INFO] - [train] Step 119100 out of 120000 | Loss --> 1.973 | Grad_l2 --> 0.367 | Weights_l2 --> 59492.836 | Lr --> 0.000 | Seconds_per_step --> 2.091 | [2024-04-24 18:17:50,209][Main][INFO] - [train] Step 119200 out of 120000 | Loss --> 1.988 | Grad_l2 --> 0.369 | Weights_l2 --> 59492.888 | Lr --> 0.000 | Seconds_per_step --> 2.144 | [2024-04-24 18:21:24,579][Main][INFO] - [train] Step 119300 out of 120000 | Loss --> 1.979 | Grad_l2 --> 0.374 | Weights_l2 --> 59492.982 | Lr --> 0.000 | Seconds_per_step --> 2.144 | [2024-04-24 18:24:55,406][Main][INFO] - [train] Step 119400 out of 120000 | Loss --> 1.961 | Grad_l2 --> 0.369 | Weights_l2 --> 59493.009 | Lr --> 0.000 | Seconds_per_step --> 2.108 | [2024-04-24 18:28:26,840][Main][INFO] - [train] Step 119500 out of 120000 | Loss --> 1.961 | Grad_l2 --> 0.375 | Weights_l2 --> 59493.029 | Lr --> 0.000 | Seconds_per_step --> 2.114 | [2024-04-24 18:31:57,739][Main][INFO] - [train] Step 119600 out of 120000 | Loss --> 1.968 | Grad_l2 --> 0.366 | Weights_l2 --> 59493.065 | Lr --> 0.000 | Seconds_per_step --> 2.109 | [2024-04-24 18:35:32,023][Main][INFO] - [train] Step 119700 out of 120000 | Loss --> 1.970 | Grad_l2 --> 0.358 | Weights_l2 --> 59493.067 | Lr --> 0.000 | Seconds_per_step --> 2.143 | [2024-04-24 18:39:00,704][Main][INFO] - [train] Step 119800 out of 120000 | Loss --> 1.949 | Grad_l2 --> 0.372 | Weights_l2 --> 59493.070 | Lr --> 0.000 | Seconds_per_step --> 2.087 | [2024-04-24 18:42:34,346][Main][INFO] - [train] Step 119900 out of 120000 | Loss --> 1.933 | Grad_l2 --> 0.369 | Weights_l2 --> 59493.070 | Lr --> 0.000 | Seconds_per_step --> 2.136 | [2024-04-24 18:46:05,481][Main][INFO] - [train] Step 120000 out of 120000 | Loss --> 1.952 | Grad_l2 --> 0.368 | Weights_l2 --> 59493.065 | Lr --> 0.000 | Seconds_per_step --> 2.111 | [2024-04-24 18:46:05,884][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-24 18:50:29,886][Main][INFO] - [eval] Step 120000 out of 120000 | Loss --> 1.797 | Accuracy --> 0.672 | Time --> 264.402 | [2024-04-24 18:50:29,891][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-120000 [2024-04-24 18:50:29,895][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-04-24 18:50:33,377][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-120000/model.safetensors [2024-04-24 18:50:33,466][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-120000/optimizer.bin [2024-04-24 18:50:33,468][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-120000/scheduler.bin [2024-04-24 18:50:33,468][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-120000/sampler.bin [2024-04-24 18:50:33,468][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-120000/sampler_1.bin [2024-04-24 18:50:33,478][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-120000/random_states_0.pkl [2024-04-24 18:50:34,797][datasets.iterable_dataset][WARNING] - Too many dataloader workers: 8 (max is dataset.n_shards=1). Stopping 7 dataloader workers. [2024-04-24 18:54:56,495][Main][INFO] - [eval] Step 120001 out of 120000 | Loss --> 1.799 | Accuracy --> 0.672 | Time --> 261.910 | [2024-04-24 18:54:56,498][accelerate.accelerator][INFO] - Saving current state to checkpoint-pt-120001 [2024-04-24 18:54:56,501][accelerate.utils.other][WARNING] - Removed shared tensor {'decoder.embed_tokens.weight', 'encoder.embed_tokens.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading [2024-04-24 18:54:59,634][accelerate.checkpointing][INFO] - Model weights saved in checkpoint-pt-120001/model.safetensors [2024-04-24 18:54:59,688][accelerate.checkpointing][INFO] - Optimizer state saved in checkpoint-pt-120001/optimizer.bin [2024-04-24 18:54:59,689][accelerate.checkpointing][INFO] - Scheduler state saved in checkpoint-pt-120001/scheduler.bin [2024-04-24 18:54:59,689][accelerate.checkpointing][INFO] - Sampler state for dataloader 0 saved in checkpoint-pt-120001/sampler.bin [2024-04-24 18:54:59,689][accelerate.checkpointing][INFO] - Sampler state for dataloader 1 saved in checkpoint-pt-120001/sampler_1.bin [2024-04-24 18:54:59,690][accelerate.checkpointing][INFO] - Random states saved in checkpoint-pt-120001/random_states_0.pkl